Search LLT:
Volume 5, Number 3
September 2001
Columns
Using Corpora in Language Teaching
and Learning
From the Editors
Welcome to LLT
by Mark Warschauer, Dorothy Chun
& Pamela DaGrossa
p. 1
From the Special Issue Editors
Introducing This Issue
by Chris Tribble & Michael Barlow
pp. 2-3
On the Net
Finding Song Lyrics Online
by Jean W. LeLoup & Robert Ponterio
pp. 4-6
Emerging Technologies
Tools and Trends in Corpora Use for Teaching
and Learning
by Bob Godwin-Jones
pp. 7-12
Announcements
News from Sponsoring Organizations
pp. 13-18
Reviews
Edited by Jennifer Leeman
Multilingual Corpora in Teaching and Research
Simon P. Botley, Anthony M. McEnery, &
Andrew Wilson (Eds.)
Reviewed by John M. Lawler
pp. 19-23
Patterns and Meanings: Using Corpora for
English Language Research and Teaching
Alan Partington
Reviewed by József Horváth,
pp. 24-27
Exploring Academic English: A Workbook for
Student Essay Writing
Jennifer Thurstun & Christopher Candlin
Reviewed by Paul Thompson
pp. 28-31
Feature Articles
Genres, Registers, Text Types, Domain, and Styles:
Clarifying the Concepts and Navigating a Path
Through the BNC Jungle
David YW Lee
Lancaster University
pp. 37-72
Text Categories and Corpus Users: A Response to
David Lee (Commentary)
Guy Aston
University of Bologna, Italy
pp. 73-76
An Evaluation of Intermediate Students' Approaches
to Corpus Investigation
Claire Kennedy & Tiziana Miceli
Griffith University, Brisbane
pp. 77-90
Looking at Citations: Using Corpora in English for
Academic Purposes
Paul Thompson
Reading University
Chris Tribble
King's College London University & Reading
University
pp. 91-105
Lexical Behaviour in Academic and Technical
Corpora: Implications for ESP Development
Alejandro Curado Fuentes
University of Extremadura, Spain
pp. 106-129
Contact: Editors or Web Production Editor
Copyright © 2001 Language Learning & Technology, ISSN 1094-3501.
Articles are copyrighted by their respective authors.
MonoConc Pro and WordSmith Tools
Reviewed by Randi Reppen
pp. 32-36
Teaching German Modal Particles: A Corpus-Based
Approach
Martina Mollering
Macquarie University, Sydney
pp. 130-151
The Emergence of Texture: An Analysis of the
Functions of the Nominal Demonstratives in an
English Interlanguage Corpus
Terry Murphy
Yonsei University, Seoul
pp. 152-173
A Case for Using a Parallel Corpus and Concordancer
for Beginners of a Foreign Language
Elke St.John
University of Sheffield
pp. 174-184
Exploring Parallel Concordancing in English and
Chinese
Wang Lixum
The Open University of Hong Kong
pp. 185-203
Call for Papers
Theme: Distance Learning
Corpora Research Bibliography
Contact: Editors or Web Production Editor
Copyright © 2001 Language Learning & Technology, ISSN 1094-3501.
Articles are copyrighted by their respective authors.
About Language Learning & Technology
Language Learning & Technology is a refereed journal which began publication in July 1997. The journal
seeks to disseminate research to foreign and second language educators in the U.S. and around the world
on issues related to technology and language education.
•
Language Learning & Technology is sponsored and funded by the University of Hawai'i National
Foreign Language Resource Center (NFLRC) and the Michigan State University Center for
Language Education And Research (CLEAR), and is co-sponsored by Apprentissage des Langues
et Systèmes d'Information et de Communication (ALSIC), the Australian Technology Enhanced
Language Learning Consortium (ATELL), the Center for Applied Linguistics (CAL), the
Computer Assisted Language Instruction Consortium (CALICO), the European Association for
Computer Assisted Language Learning (EUROCALL), the International Association for
Language Learning Technology (IALLT), and the University of Minnesota Center for Advanced
Research on Language Acquisition (CARLA).
•
Language Learning & Technology is a fully-refereed journal with an editorial board of scholars in
the fields of second language acquisition and computer-assisted language learning. The focus of
the publication is not technology per se, but rather issues related to language learning and
language teaching, and how they are affected or enhanced by the use of technologies.
•
Language Learning & Technology is published exclusively on the World Wide Web. In this way,
the journal seeks to (a) reach a broad audience in a timely manner, (b) provide a multimedia
format which can more fully illustrate the technologies under discussion, and (c) provide
hypermedia links to related background information.
•
Language Learning & Technology is currently published three times per year (January, May,
September).
Copyright © 2001 Language Learning & Technology, ISSN 1094-3501.
Articles are copyrighted by their respective authors.
Sponsors, Board, Editors, and Designers
Sponsoring Organizations
Sponsors
University of Hawai`i National Foreign Language Resource Center (NFLRC)
Michigan State University Center for Language Education and Research (CLEAR)
Co-Sponsors
Apprentissage des Langues et Systèmes d'Information et de Communication (ALSIC)
Australian Technology Enhanced Language Learning Consortium (ATELL)
Center for Advanced Research on Language Acquisition, University of Minnesota (CARLA)
Center for Applied Linguistics, Washington, DC (CAL)
Computer Assisted Language Instruction Consortium (CALICO)
European Association for Computer Assisted Language Learning (EUROCALL)
International Association for Language Learning Technology (IALLT)
Advisory and Editorial Boards
Advisory Board
Susan Gass
Richard Schmidt
Michigan State University
University of Hawai`i
gass@msu.edu
schmidt@hawaii.edu
University of Hawai`i at Manoa
The George Washington Univ.
Université de Franche-Comte
Iowa State University
University of Hawai`i at Manoa
University of Hawai`i at Manoa
Thames Valley University
University of Melbourne
Virginia Commonwealth Univ.
Univ. of MD, University College
Northern Arizona University
University of Haifa
University of Queensland
San Diego State University
Georgetown University
SUNY-Albany
San Jose State University
University of San Francisco
University of Texas at El Paso
brownj@hawaii.edu
auchamot@gwu.edu
thierry.chanier@univ-fcomte.fr
carolc@iastate.edu
crookes@hawaii.edu
crosby@ics.hawaii.edu
grahamdavies1@compuserve.com
robert@genesis.language.unimelb.edu.au
rgjones@atlas.vcu.edu
lhart@umuc.edu
joan.jamieson@nau.edu
batialau@research.haifa.ac.il
a.luke@mailbox.uq.edu.au
mlymanha@mail.sdsu.edu
mackeya@gusun.georgetown.edu
cmeskill@uamail.albany.edu
denise.murray@mq.edu.au
nagatan@usfca.edu
novick@cs.utep.edu
Editorial Board
James D. Brown
Anna Uhl Chamot
Thierry Chanier
Carol Chapelle
Graham Crookes
Martha E. Crosby
Graham Davies
Robert Debski
Robert Godwin-Jones
Lucinda Hart-González
Joan Jamieson
Batia Laufer
Allan Luke
Mary Ann Lyman-Hager
Alison Mackey
Carla Meskill
Denise Murray
Noriko Nagata
David G. Novick
Patricia Paulsell
Jill Pellettieri
Joy Kreeft Peyton
Jenise Rowekamp
Rafael Salaberry
Larry Selinker
Maggie Sokolik
Seppo Tella
Leo van Lier
Yong Zhao
Michigan State University
CA State Univ., San Marcos
Center for Applied Linguistics,
Washington, DC
University of Minnesota
Rice University
University of London
University of Cal., Berkeley
University of Helsinki
Monterey Institute of
International Studies
Michigan State University
paulsell@msu.edu
pjill@csusm.edu
joy@cal.org
rowek001@tc.umn.edu
salaberry@rice.edu
l.selinker@app-ling.book.ac.uk
sokolik@socrates.berkeley.edu
seppo.tella@helsinki.fi
lvanlier@miis.edu
zhaoyo@msu.edu
Editorial Staff
Editors
Mark Warschauer
Dorothy Chun
Associate Editors
Irene Thompson
Managing Editor
Web Production
Editor
Book & Software
Review Editor
On the Net Editors
Emerging
Technologies Editor
Copyeditors
Richard Kern
Pamela DaGrossa
Dennie Hoopingarner
Jennifer Leeman
Jean LeLoup
Robert Ponterio
Robert Godwin-Jones
Scott Armstrong
Jan McNeil
Scott Petersen
John Rylander
Anthony Silva
University of CA, Irvine
University of CA, Santa
Barbara
The George Washington
University (Emerita)
Univ. of CA, Berkeley
University of Hawai`i
Michigan State
University
George Mason University
markw@uci.edu
dchun@humanitas.ucsb.edu
SUNY at Cortland
SUNY at Cortland
Virginia Commonwealth
University
Harvard University
National University of
Singapore
Meitoku Junior College
University of Hawai`i
Chaminade University
leloupj@cortland.edu
ponterior@cortland.edu
rgjones@atlas.vcu.edu
napooka@aloha.net
kern@socrates.berkeley.edu
dagrossa@hawaii.edu
hooping4@msu.edu
leemanj@georgetown.edu
scott9@mediaone.net
janamerican@yahoo.com
rv5s-ptrs@asahi-net.or.jp
rylander@hawaii.edu
a.silva@att.net
Copyright © 2001 Language Learning & Technology, ISSN 1094-3501.
The contents of this publication were developed under a grant from the Department of Education (CFDA 84.229, P229A6001296 and P229A6007). However, the contents do not necessarily represent the policy of the Department of Education, and one
should not assume endorsement by the Federal Government.
Information for Contributors
Language Learning & Technology is seeking submissions of previously unpublished manuscripts on any
topic related to the area of language learning and technology. Articles should be written so that they are
accessible to a broad audience of language educators, including those individuals who may not be
familiar with the particular subject matter addressed in the article. General guidelines are available for
reporting on both quantitative and qualitative research.
Manuscripts are being solicited in the following categories:
Articles | Commentaries | Reviews
Articles
Articles should report on original research or present an original framework that links previous research,
educational theory, and teaching practices. Full-length articles should be no more than 8,500 words in
length and should include an abstract of no more than 200 words. We encourage articles that take
advantage of the electronic format by including hypermedia links to multimedia material both within and
outside the article.
All article manuscripts submitted to Language Learning & Technology go through a two-step review
process.
Step 1: Internal Review. The editors of the journal first review each manuscript to see if it meets the
basic requirements for articles published in the journal (i.e., that it reports on original research or
presents an original framework linking previous research, educational theory, and teaching practices),
and that it is of sufficient quality to merit external review. Manuscripts which do not meet these
requirements or are principally descriptions of classroom practices or software are not sent out for
further review, and authors of these manuscripts are encouraged to submit their work elsewhere. This
internal review takes about 1-2 weeks. Following the internal review, authors are notified by e-mail as to
whether their manuscript has been sent out for external review or, if not, why not.
Step 2: External Review. Submissions which meet the basic requirements are then sent out for blind
peer review from 2-3 experts in the field, either from the journal's editorial board or from our larger list
of reviewers. This second review process takes 2-3 months. Following the external review, the authors
are sent copies of the external reviewers' comments and are notified as to the decision (accept as is,
accept pending changes, revise and resubmit, or reject).
Commentaries
Commentaries are short articles, usually no more than 2,000 words, discussing material previously
published in Language Learning & Technology or otherwise offering interesting opinions on theoretical
and research issues related to language learning and technology. Commentaries which comment on
previous articles should do so in a constructive fashion. Hypermedia links to additional information may
be included. Commentaries go through the same two-step review process as for articles described above.
Submission Guidelines for Articles and Commentaries
Please list the names, institutions, e-mail addresses, and if applicable, World Wide Web addresses
(URLs), of all authors. Also include a brief biographical statement (maximum 50 words, in sentence
format) for each author. (This information will be temporarily removed when the articles are distributed
for blind review.)
Articles and commentaries can be transmitted in either of the following ways:
(a) By electronic mail, send the main document and any accompanying files (images, etc.) to
llt-editors@hawaii.edu
(b) By mail, send the material on a Macintosh or IBM diskette to the following address:
LLT
NFLRC
University of Hawai'i at Manoa
1859 East-West Road, #106
Honolulu, HI 96822
USA
Please check the General Policies below for additional guidelines.
Reviews
Language Learning & Technology publishes reviews of professional books, classroom texts, and
technological resources related to the use of technology in language learning, teaching, and testing.
Reviews should normally include references to published theory and research in SLA, CALL, pedagogy,
or other relevant disciplines. Reviewers are encouraged to incorporate images (e.g., screen shots or book
covers) and hypermedia links that provide additional information, as well as specific ideas for classroom
or research-oriented implementations.
Reviews of individual books or software are generally 1,200-1,600 words long, while comparative
reviews of multiple products may be 2,000 words or longer. They can be submitted in ASCII, Rich Text
Format, Word, or HTML. Accompanying images should be sent separately as jpeg or gif files. Reviews
should include the name, institutional affiliation, e-mail address, URL (if applicable), and a short
biographical statement (maximum 50 words) of the reviewer(s). In addition, the following information
should be included in a table at the beginning of the review:
Books
Author(s)
Title
Series (if applicable)
Publisher
City and country
Year of publication
Number of pages
Price
ISBN
Software
Title (including previous titles, if applicable) and
version number
Platform
Minimum hardware requirements
Publisher (with contact information)
Support offered
Target language
Target audience (type of user, level, etc.)
Price
ISBN (if applicable)
LLT does not accept unsolicited reviews. Contact Jennifer Leeman if you are interested in having
material reviewed or in serving as a reviewer (leemanj@georgetown.edu).
Jennifer Leeman
Dept. of Modern and Classical Languages
Mail Stop #3E5
George Mason University
Fairfax, VA 22030
General Policies
The following policies apply to all articles, reviews, and commentaries:
1. All submissions should conform to the requirements of the Publication Manual of the American
Psychological Association (4th edition). Authors are responsible for the accuracy of references and
citations, which must be in APA format.
2. Manuscripts that have already been published elsewhere or are being considered for publication
elsewhere are not eligible to be considered for publication in Language Learning & Technology. It is
the responsibility of the author to inform the editor of any similar work that is already published or
under consideration for publication elsewhere.
3. Authors of accepted manuscripts will assign to Language Learning & Technology the permanent
right to electronically distribute their article, but authors will retain copyright and, after the article
has appeared in Language Learning & Technology, authors may republish their text (in print and/or
electronic form) as long as they clearly acknowledge Language Learning & Technology as the
original publisher.
4. The editors of Language Learning & Technology reserve the right to make editorial changes in any
manuscript accepted for publication for the sake of style or clarity. Authors will be consulted only if
the changes are major.
5. Authors of published articles, commentaries, and reviews will receive 10 free hard-copy offprints of
their articles upon publication.
6. Articles and reviews may be submitted in the following formats:
(a)
(b)
(c)
(d)
HTML files
Microsoft Word documents
RTF documents
ASCII text
If a different format is required in order to better handle foreign language fonts, please consult with the
editors.
Copyright © 2001 Language Learning & Technology, ISSN 1094-3501.
Articles are copyrighted by their respective authors.
Language Learning & Technology
http://llt.msu.edu/vol5num3/from_the_editors.html
September 2001, Vol. 5, Num. 3
p. 1
From the Editors
This is a special issue of Language Learning & Technology on using corpora in
language teaching and learning. The Guest Editors, Christopher Tribble and Michael
Barlow, have written an Introduction to the issue.
In addition to the fine collection of articles and reviews in this issue, we are delighted to
announce the addition to the LLT site of a bibliography focused on language corpora.
This site is maintained by LLT and your contributions to it are welcome.
Although the journal is free and available to anyone with Internet access, subscriptions
are important. The information obtained through subscriptions allows us to demonstrate
to our funders the primary reason to continue supporting the journal, namely, our broad
readership. If you have not already done so, please take a moment to subscribe to the
journal. If you are already a subscriber, we appreciate your continued support and
welcome your feedback.
Finally, we are pleased to announce an upcoming special issue on Distance Learning, to
be guest edited by Margo Glew of Michigan State University. With the current rate at
which distance learning is being embraced around the world, we anticipate an exciting
issue and look forward to your contributions.
Mark Warschauer & Dorothy Chun
Editors
Pamela DaGrossa
Managing Editor
Copyright 2001, ISSN 1094-3501
1
Language Learning & Technology
http://llt.msu.edu/vol5num3/from_the_spec_issue_ed.html
September 2001, Vol. 5, Num. 3
p. 2-3
From the Special Issue Editors
This Special Issue of Language Learning and Technology has been in the making for
many months. We feel it has been worth the effort, and hope that our readers do, too. If
you've never used corpus tools in your teaching or learning, we hope that the Special
Issue inspires you to investigate further (the research bibliography that has been
launched with this special edition should be helpful to this end). If you have been
working with this kind of resource for some time, we are sure that you will find articles
here that will help you extend and deepen your understanding of the potential of corpora
and corpus tools.
The Articles
There are nine major articles in this edition of LLT -- making it one of the largest that
the Journal has produced -- and they cover four broad areas of interest to language
teachers and students. These concern the kinds of corpus that are most helpful for
language learning and teaching; practical applications of corpus resources in special
purposes teaching; using corpora in grammar teaching and language awareness raising;
and finally the value of parallel aligned corpora (multi-lingual resources which are
receiving growing interest in teaching and translation studies) in language learning and
teaching.
In the first section, Lee's piece on problems that can arise for teachers and researchers
who want to use the British National Corpus (BNC) is of particular relevance as his
account of the problematic area of genre offers a comprehensive guide to the topic. The
article is not uncontentious, however, as is made clear by Aston's response in which,
while valuing Lee's contribution, he also points out reasons why the BNC has been
structured as it is, and gives insights into how teachers can make fuller use of what it
offers.
Following this account of issues associated with one of the most important English
language corpora, Kennedy and Miceli discuss some of the ways in which language
learners can benefit from the investigative approaches which corpus use encourages in
language education, and Thompson and Tribble outline a practical application of corpus
research methods in helping learners gain mastery of a central skill in academic writing
-- citation. These two articles are followed by a further practically oriented paper in
which Curado demonstrates the value of corpus informed teaching and learning in ESP,
in particular in relation to vocabulary development.
The third section of the Special Issue considers matters more closely related the
research/language teaching interface. Mollering's article on German modal particles
provides a very clear account of ways in which a corpus can be used in language
description. Murphy's paper on "emergent texture" demonstrates how a corpus based
approach can provide significant information about interlanguage development. Finally,
in section four there are two papers dealing with applications of parallel aligned corpora.
Wang's innovative piece shows that what might be considered a purely academic
resource can offer learners very real benefits, and St.John's article provides a neat
demonstration of the practical relevance of parallel corpus informed teaching with
beginner students of German.
Copyright 2001, ISSN 1094-3501
2
Christopher Tribble and Michael Barlow
From the Special Issue Editors
The Columns
In On the Net, Jean LeLoup and Robert Ponterio provide guidance for "Finding Song
Lyrics Online," a wonderful way to bring authentic language materials into the
classroom for use in learning vocabulary, grammar, and topical information. And in
keeping with our Special Issue topic, Robert Godwin-Jones brings us information on
"Tools and Trends in Corpora Use for Teaching and Learning" in his Emerging
Technologies column.
The Journal's sponsors are key in publicizing and otherwise supporting the journal.
Please take a moment to find out what these organizations do and what are contributing
to the field of language learning and technology under Announcements.
Jennifer Leeman, the Reviews Editor, brings us reviews of three books and one software
program this issue. John Lawler reviews Botley, Mcenery, & Wilson's Multilingual
Corpora in Teaching and Research; József Horváth comments on Patterns and
Meanings: Using Corpora for English Language Research and Teaching by Alan
Partington; and Paul Thompson reviews Exploring Academic English: A Workbook for
Student Essay Writing. Finally, Randi Reppen appraises MonoConc Pro and WordSmith
Tools, software programs which are mentioned throughout this issue.
As editors, we have had the difficult task of selecting from a large number of
contributions -- an indication of itself of the growing interest in this area. However, we
have had wonderful support from the LLT team -- in particular Pamela DaGrossa,
Managing Editor, and, of course, the Journal's General Editor Mark Warschauer, so
many thanks to them. Also, we wish to thank the anonymous reviewers who have so
generously given their time and professional insight. We hope that they (and you) feel
that this special edition justifies their support.
Christopher Tribble
King's College London University (UK)
School of Linguistics and Applied Language Studies, Reading University (UK)
Michael Barlow
Rice University, Texas (USA)
Language Learning & Technology
3
Language Learning & Technology
http://llt.msu.edu/vol5num3/onthenet
September 2001, Vol. 5, Num. 3
pp. 4-6
ON THE NET
Finding Song Lyrics Online
Jean W. LeLoup
SUNY Cortland
Robert Ponterio
SUNY Cortland
Most foreign language teachers enjoy studying song lyrics as authentic text in their classes. Songs can be
used at all levels and for a wide variety of activities and purposes such as comprehension, vocabulary
introduction, illustration or recognition of grammar structures, and reinforcement of topics. Traditional or
new children's songs, musical classics, or the latest pop hits are all fair game. The rhythm and melody of
songs can make the words and expressions easier to remember and more enjoyable for students than other
sorts of texts. But providing written support for the lyrics can sometimes be a problem. Photocopying the
lyrics from the album cover might not meet the needs of a specific activity if some modification, such as
blanking out some words or adding definitions, is required. Retyping or transcribing the lyrics takes time
that the teacher might not be able to spare, though, of course, transcribing lyrics is a good listening
activity for us teachers as well as for our students. The Internet has become a useful source of song lyrics
that can be copied into a word processor and transformed into an activity for class use.
Sometimes these lyrics can be easy to find, but teachers often ask us for help locating songs that they have
searched for in vain. We will explore some of the kinds of sites where song lyrics may be found and
describe some techniques that can help teachers use WWW search engines to locate the lyrics to a
particular song more quickly.
When searching for song lyrics, one needs to think a bit differently from the way one might approach
searching for other kinds of information online. Many teachers begin by looking for a good Web site for
song lyrics. Although there are some sites that do present a selection of lyrics as a corpus, in most cases
this is not a productive search strategy because the songs are generally not collected in one place but
rather distributed around the Internet in millions of different sites.
Where can one find these songs? Record labels often have official Web sites for their artists that provide a
variety of information about their activities and usually add a "discography" and/or "lyrics" section that
might include song lyrics. This site for Patricia Kaas is managed by Sony Music:
http://www.sonymusic.fr/kaas/
Some companies seem to be very protective of their control of the lyrics and have even closed down
private sites that put lyrics online. Official and unofficial fan club sites sometimes duplicate or replace the
function of record label in promoting the artist. For example, discography and lyrics pages for Mecano
can be found at the MecanoWeb site:
http://www.geocities.com/~mecanoweb/LETRAS.html
http://www.geocities.com/mecanoweb/DISCOGRAFIAmecano.html
Other private sites by individual music fans are another option, and these might be located anywhere in
the world.
http://members.es.tripod.de/Ananta/letras/mecano.htm
Many individuals might have a reason to include the text of a particular song on a Web page. If you need
the lyrics for all of the songs on an album, the most efficient search strategy will likely be different than if
you simply need to find a particular song.
Copyright 2001, ISSN 1094-3501
4
Jean LeLoup and Robert Ponterio
On the Net
There are many search engines for the Web whose results will be similar, so it is not necessary to use any
particular site. Some people have a preference for a certain search engine, and this is fine. A few favorites
are Altavista.com, Google.com, Snap.com, Yahoo.com, Lycos.com. Many search engines allow the user
to specify sites in a particular language, but this is generally not useful as few Web sites bother to label
their language. So including a language in the search might even prevent finding the pages you need.
The most important feature to use when searching for songs is using quotation marks to identify a string
of words that go together. "Twinkle, twinkle little star" should locate the title that we intend to find, but
without the quotation marks we might also find "The little star will twinkle brightly." Careful use of
quotation marks will eliminate false hits -- pages that match the search criteria even though they are not
what we want. The more false hits we get, the harder it is and the longer it takes to track down what we
really need. But quoting strings that are too long can have the opposite result if some small difference in
the text makes the string in the Web page slightly different from the search string. For example, if the title
in the Web page appears on two separate lines:
Twinkle, Twinkle
Little Star
Our search might miss the very page we are looking for.
Every search is a matter of narrowing or widening the search parameters depending on whether we are
getting too many false hits or not enough good hits. Quoting strings tends to narrow the search, so use
fewer quotes if the search results seem too narrow, more quotes if the results seem too wide.
But just what should we be searching for? A problem for many novice Web searchers is that they begin
by searching for words that identify the topic rather than words that will appear on the pages they hope to
find. For example, very few pages of song lyrics include the word "lyrics," so do not use the word "lyrics"
in the search for the words of a particular song. However, the word "lyrics" might be effective in looking
for a collection of lyrics of many songs. Of course, a page in Spanish will probably use the word "letras"
rather than "lyrics," so don't forget to consider the various possibilities in the languages that you use.
Most song lyrics pages include the name of the artist and the song title, but not all of them do. In addition,
the artist name and song title are the elements most likely to be present in some fancy format that might
prevent the search engine from seeing them correctly.
Clearly, the words that will always be on any page containing the lyrics of a song are the words of the
song itself, and these are invariably the most effective search parameters. The words "Twinkle, twinkle
little star" are in the song but are also the title, so that search will bring up many pages that include only
the titles of songs and not the lyrics. The search string "how I wonder what you are" will be more likely to
find only pages with the lyrics of the song. Be sure to consider how common an expression is when
selecting search criteria. For instance, "what you are" is a string that we can expect to find in many
contexts other than this song. Less common expressions from the song will be more effective: "above the
world," "diamond in the sky." Some songs also have different versions whose lyrics may vary. This is
something to consider depending on whether one is looking for a particular version or all versions of a
song.
A search for
Twinkle "how I wonder" "above the world" diamond
is likely to locate the pages we want very effectively. The addition of words from other stanzas might
help us eliminate pages that only include the first stanza. In short, the best search strategy is to include
only words and short phrases that must appear in the pages we hope to find.
To locate sites that provide the lyrics of many songs -- for example, all the songs on a particular album -a different approach is required. In this case one might find either a page with a list of song titles and links
Language Learning & Technology
5
Jean LeLoup and Robert Ponterio
On the Net
to the words of each song, or a long page with the lyrics of many songs. In the first case a search for a
couple of titles might work; in the second case, expressions from the lyrics of several songs will be more
effective. The problem with searching for titles is that far too many pages will be found that list titles
without providing the lyrics. In this case, adding the search term "lyrics" or an appropriate substitute in
the targeted language might help.
One example of a useful collection of lyrics is the "comptines" page of the "Premiers pas sur Internet" site
for French children: http://www.momes.net/comptines/index.html, including the words and often the
music for hundreds of children's songs.
There are times, though, when the list of titles can be of use. Some sites that sell CDs online also provide
audio excerpts of individual songs. This can be a useful tool for the language teacher in search of new
music in the target language, especially for teachers who do not often get to travel to countries where the
language is spoken.
A caveat: once you find the lyrics, check them out carefully before using them. Many Web pages contain
errors and misspellings. The lyrics on many pages will require corrections before they are shared with
students. Copy and paste them into your favorite word processor; read them carefully while listening to
the song, and use the spell check.
Now that we've discovered how to find these lyrics, what can we do with them in the FL classroom? As
was previously noted, many FL teachers like to use songs as authentic materials in their curriculum.
Songs can be used in a variety of ways for FL instruction. A search of the FLTEACH archives from
January 1, 1999 to the present using the keywords "song lyrics" yields at least 146 hits, ranging from
postings that are requesting aid in finding lyrics and using them to detailed messages describing grammar
and other language lessons that are enhanced by the use of songs and their lyrics. One example of a lesson
that uses lyrics for literacy in the L2 was profiled in a previous column: Literacy: Reading on the Net. A
sample message by Kathy White from the FLTEACH archives offers nearly 40 suggestions for activities
using music and songs in the FL classroom.
Another FLTEACH post by Claudia Irigoin offers song activities from a workshop presentation given in
Argentina. The purpose of the workshop was to help teachers motivate students in writing in English (an
L2 there). You might also wish to expand repeated portions of songs to make it easier for students to
follow along. For listening comprehension, some words or phrases may be replaced by underlining to
allow students to fill in the blanks: a cloze task. Definitions or translations of phrases may be added in the
margins or footers. Grammatical elements may be highlighted. The text, thus modified, can become a
useful tool for language study.
Using songs is a wonderful way to make the target language accessible to language learners. It is a
universal medium, and speaks volumes about cultural origin, language patterns, and usage. The power
that songs contain is underscored by George Jellinek (WQXR-FM): "The history of a people is found in
its songs." On a more basic level, music and songs are simply the stuff that life is made of: "Give me a
laundry list and I'll set it to music" (Gioacchino Antonio Rossini).
Language Learning & Technology
6
Language Learning & Technology
http://llt.msu.edu/vol5num3/emerging/
September 2001, Vol. 5, Num. 3
pp. 7-12
EMERGING TECHNOLOGIES
Tools and Trends in Corpora Use for Teaching and Learning
Bob Godwin-Jones
Virginia Commonwealth University
INTRODUCTION
Language corpora have long been exploited for language instruction. Vocabulary lists for learners, for
example, have been generated from corpora, and word counts derived from corpus analysis have helped
in defining goals for vocabulary acquisition. Dictionary and textbook creators have used corpora
extensively. In recent years, the move to the use of authentic language materials in language pedagogy
has enhanced the role collections of spoken or written language can play in language learning. Corpora
are, after all, huge storehouses of real language use. The interest in languages for special purposes further
favors the use of corpora, as a means to identify the specific language components to be taught.
Technology enhancements have made corpora more widely available, as well as provided more powerful
tools for their use. In particular, the Internet is playing a steadily growing role in the dissemination of
corpora and corpus-based teaching materials. Corpora are no longer the exclusive domain of
lexicographers and computational linguists.
ACCESS TO CORPORA
Corpora are of interest today to professionals in a wide variety of fields, from ethnologists to
telecommunication conglomerates. Creating a language corpus is a major undertaking, both timeconsuming and expensive. This is all the more the case for collections which include multiple languages
and/or audio/video recordings. Given the cost and the growing interest, it makes little sense for corpora
not to be made widely accessible. In fact, there have been a large number of corpora in many different
languages which have become available over the Internet in the last few years. Good starting points for
finding them are Michael Bohman's Corpus Linguistics page, the Linguistic Exploration page (at the LDC
- Linguistic Data Consortium) or the Tractor page (the "Telri Research Archive of Computational Tools
and Resources"). These pages in many cases link to direct corpus access, including a number of parallel
corpora of particular interest in translation studies and language learning for specific purposes. There are
as well a substantial number of text collections of literary works in a variety of languages. Some include
comprehension aids and annotations for use in language learning.
As the number of language archives grows, locating the specific resources needed for a project will
become more problematic. One can only go so far with lists of Web links (even when annotated) or
traditional Web searching. There is a recently launched international project, the Open Language
Archives Community (OLAC), to build an infrastructure linking language archives of all types together.
OLAC builds on the Open Archives Initiative and on the Dublin Core Metadata Initiative. The Dublin
Core project began in 1995 to develop conventions for resource searching on the Web. OLAC uses the
core 15 elements of the Dublin Core and extends them through the use of qualifiers to fit the needs of the
language community. The use of a controlled vocabulary of descriptors should allow more efficient
searching of archives.
The consistent use of meta-data in language resources is likely to become of growing importance in the
language community. There has not been a standard way to include information about a resource, such as
the participants in an interview included in a corpus (i.e., age, nationality, first language, education, etc.).
Such information is typically included in a "header" which is either part of the resource file itself or stored
separately. Because of the different ways such meta-information has been stored, there has been a
proliferation of tools and approaches for the user to access that information. It would be very helpful for
Copyright 2001, ISSN 1094-3501
7
Bob Godwin-Jones
Emerging Technologies
both researchers and users to have a common approach to resource description, not only for corpora, but
for all language resources such as text collections, lexicons, grammar tutorials, multimedia files, Web
lessons, and so forth. This would in turn facilitate the development of universal tools.
ENCODING AND ANNOTATION
Standardization, or at least inter-operability, is needed not only in resource description but, of course, also
in the encoding and annotation of language resources. Increasingly, corpus creators have moved from
proprietary systems to standard-based approaches. Given the effort behind corpus creations and the
longevity of most corpora, the challenge is to design an environment which is adaptable over time as
technologies evolve. It also needs to be flexible enough to be extensible to include classification
categories for which a need may arise in the future.
In recent years the Text Encoding Initiative (TEI) has provided a standard used by a large number of
language and literature resources. TEI uses SGML ("specialized general markup language") and provides
for an extensive header containing meta-data. The header information is included within the annotation
files. While the most widespread use has been in the encoding of literary texts, there is also an extensive
list of projects using TEI encoding in corpora in a variety of languages. The TEI standard is part of the
Corpus Encoding Standard (CES) proposed by the EAGLES group ("Expert Advisory Group on
Language Engineering Standards"). CES specifies a minimal encoding level to be "standard" and
provides encoding specifications for linguistic annotation. TEI is also used in the MATE project
("Multilevel Annotation Tools Engineering"), designed for the encoding and annotation of spoken
dialogue corpora. The TEI standard, however, has some drawbacks as well. SGML is highly complex (as
experienced by anyone having tried to decipher the intricacies of the TEI header), and SGML documents
are not directly accessible from standard Web browsers. While extensible, customizing the TEI for an
individual project is a daunting enterprise. Some projects have done so, such as the BBAW digital
dictionary of German, adding custom headers in separate files.
In fact, there are a number of advantages to a "stand-off" data architecture in which the annotations and
meta-data are stored in separate files from the data itself. This allows for considerable flexibility in adding
and changing the annotation categories and information as needed, without having to revise the data files
themselves. The encoding system that lends itself the best to doing that is XML ("extensible markup
language"), the widely acclaimed successor to HTML and slimmed-down version of SGML. Recent Web
browsers have native support for XML documents, but more importantly, there are standards and methods
for transforming XML documents on the fly into a variety of formats. For a corpus, annotation can be
stored in separate XML documents from the data itself, which are linked in hypertext to the documents.
XML enables such linking to be one-way or two-way, useful for parallel corpora. A number of the most
recent corpus projects are beginning to use XML, which in fact is being supported by the EAGLES group
in an XML version of CES (XCES). There is also an XML version of TEI forthcoming.
One of the advantages of XML is that there need not be uniformity in the precise tags used, as long as
there is an available description of each tag. Through XSLT ("extensible style language
transformations"), information from XML documents can be retrieved and reformatted in a variety of
ways, providing a powerful means for delivering data to a variety of users and browsers. Of course, a
common data model for language resources would make it much easier to standardize access. Points
(discrete objects) and spans (strings of objects) must be identified and tagged, with a common level of
granularity (i.e., detail), and a means provided of identifying structure, class membership, and inheritance.
There have been several large-scale projects, such as Tipster, to provide such a data model. The Atlas
project also aims to provide an extensible architecture for linguistic annotation, through use of an
"annotation graph model".
Language Learning & Technology
8
Bob Godwin-Jones
Emerging Technologies
RETRIEVAL TOOLS
A common (or at least exchangeable) data model would facilitate the use and development of tools for
corpus extraction. In the past, new tools were often developed for the processing of each new corpus
created. New projects needed to budget time and money not only to data collection but also to creating an
encoding/annotation system as well as a set of tools for accessing the data. Many of these tailor-made
systems replicated functionality available elsewhere but not useable due to differences in software,
platform or data architecture. Fortunately, there are tool projects underway which are designed with
reusability as a major goal. They tend to use a modular, building blocks approach, rather than a
monolithic all-or-nothing design, allowing for more flexible use as well as future extensibility. Among
such projects are GATE ("General Architecture for Text Engineering") which sets as its goal a set of
infrastructure tools for natural language processing which can accommodate models written for a variety
of programming and scripting languages. The Multext project, similarly, encompasses a series of projects
whose goals are to develop standards and specifications for the encoding of corpora and to develop tools
and resources using these standards. Multext projects are underway in at least 18 different languages.
One of the other positive developments in the area of tools is the Natural Language Software Registry
(NLSR), which collects and makes available over the Web detailed information on a wide variety of
natural language processing software, including annotation tools (taggers, parsers), speech analysis,
machine learning, evaluation tools, corpus analysis, translation, etc. The fourth edition of the NLSR
provides for both browsing and searching, using a taxonomy based on the "State of the Art in Language
Technology", edited by G.B. Varile and A. Zampolli. Many of the tools listed in the Registry are Internetbased, which is increasingly the case in tool creation. Most use Web forms to provide an access interface,
as in sample collections in French,German, Spanish, Chinese, or Japanese. An interesting approach is
provided by a service from the University of Leeds, which accepts email to amalgamtagger@comp.leeds.ac.uk containing English text, which is then parsed and tagged for parts of speech and
sent back by email.
OUTLOOK ON LANGUAGE LEARNING
One of the more frequently used tools in working with corpora for language learning are concordances. A
concordance is an alphabetical listing of words in a text or collection of texts, together with the contexts
in which they appear. Typically concordances are in KWIC format ("key word in context") in which each
word is centered in a fixed field, and each occurrence of the word is listed on a separate line. Good
concordancers do more than simply index words to lines, they can sort in a variety of ways, search for
collocations, and produce extensive statistics. Concordances have been used extensively in literary studies
and stylistic analysis, but less frequently in language learning. An extensive linguistic corpus is a gold
mine of authentic language use and mining that through KWIC concordances can provide students with
multiple contexts from which to learn new vocabulary. An interesting example of this use of
concordances is providing contextual help in the reading of second-language texts. This approach seems
to work best when students try computer-aided contextual inferences first (through the concordance)
which can then be confirmed through on-line dictionary access. Concordances can also be very useful in
providing assessment items. Cloze exercises, for example, can easily be generated from KWIC
concordances.
Corpora, of course, can provide much more than just lexical information; they are invaluable in supplying
syntactical examples. One of the caveats in using corpora in this way, is that for the most part corpora
have been created for research purposes, rather than for language learners and as a consequence may not
supply the needed information. Not all corpora, for example, are annotated for syntactic functions. Most
of the parallel corpora available are restricted to narrow, often technical, language uses, thus making them
less useful for contrastive analysis or translation studies. Such corpora can on the other hand be
Language Learning & Technology
9
Bob Godwin-Jones
Emerging Technologies
invaluable in language learning for special purposes. There have been experiments using syntactically
annotated corpora in providing grammar help for learners. The Cytor project at the University of
Lancaster showed interesting results in providing students access to concordances, which led to
improvement in their categorization of part-of-speech distinctions. This kind of activity provides a means
of putting research tools into the hands of students, and working towards shifting some of the
responsibility for learning on to their shoulders.
An area of significant interest to language educators are collections of recorded speech preserved as audio
or video. This adds an entirely new dimension to corpora, with the addition of gestures, intonation, and
facial expressions, but also adds a challenge in terms of encoding and annotation. There are several
projects underway to help in establishing standards for such resources. The ISLE Meta Data Initiative is
seeking to create a standard for meta-data description of multimedia language resources. The Talkbank
project is an interdisciplinary project hosted by Carnegie Mellon University to provide standards and
tools for human (and animal) communication. EUDICO ("European Distributed Corpora Project"), from
the Max Planck Institute, is looking at ways to categorize and search collections of annotations on digital
video and audio recordings.
One of the corpus needs for developers of CALL applications is for collections of non-native speech.
Large corpora of transcribed speech data from language learners, for example, could be very useful in
efforts to improve the understanding of the speech patterns of language learners necessary for interactive
voice applications. There are databases of telephone speech available (from LDC) in a variety of
languages. The European Science Foundation Second Language Data Bank consists of data obtained over
a 3-year period for adult migrant workers in five European countries with a focus on language learning in
the absence of formal instruction. Clearly, creating such non-native language collections is a huge task,
complicated by the fact that there should be separate databases for different kinds of non-native speakers
(according to country of origin, amount and nature of language exposure, nature of need for language
ability, etc.). The needs of the telecommunication industry for reliable voice-based applications might be
helpful in finding funding for such large-scale projects. It would be useful as well to have a corpus of
email messages, from both natives and non-native, to provide a basis for evaluating the transformation of
language through technology, and how that might affect language teaching and learning.
A significant impediment in the use of corpora in teaching and learning is the form in which most corpora
are stored. Most are annotated in SGML and housed in large Unix servers. It most cases, it is not practical
to store such large amounts of data locally. Thus access is provided remotely, which may present
performance issues. The other barrier, of course, is the proliferation of different formats for accessing
corpora and the bewildering array of tools available. The growth in Web access to corpora and tools is
helpful, but often the interfaces are poorly designed. The corpus linguistics community has recognized
this issue, as well as the need for greater consideration of teaching needs in corpus design, and the
situation looks likely to improve in the future.
RESOURCE LIST
General Corpus Information
•
•
•
•
Language Software Helpdesk from the Language Technology Group (Edinburgh)
Corpora List archive in Hypermail excellent source of up-to-date info on corpora
Multilingual Theory & Technology from Xerox
Corpus Linguistics Michael Barlow's extensive listing
Corpora Access
•
•
projects using the TEI
English Language Corpora and Corpus resources from the British National Corpus
Language Learning & Technology
10
Bob Godwin-Jones
•
•
•
•
•
•
•
•
•
•
•
Emerging Technologies
Corpora, Text Resources good list from Kiat Lab (Japan)
CobuildDirect Corpus Access Information commercial site with trial access available
TRACTOR Network of multilingual resources corpora in multiple languages listed
COMPARA Portuguese-English parallel translation corpus
COSMAS access to the Mannheim corpus of German
LAPT&DA access to special vocabulary lexica in German (Erlangen)
Digital Dictionary of the 20th Century German BBAW project
Archives for Language Documentation and Description from the University of Pennsylvania
Linguistic Exploration list of resources from the Uuniversity of Pennsylvania
Web EuroWordNet Interface access to multilingual lexical knowledge bases
European Literature - Electronic Textscomprehensive listing
Standards and Projects
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Open Language Archives Community
TEI Text Encoding Initiative
MATE Multilevel Annotation, Tools Engineering
The GATE project ambitious project for building a NLP infrastructure (Sheffield)
XML from the W3C (World Wide Web Consortium)
XSLT from the W3C (World Wide Web Consortium)
EAGLES Expert Advisory Group on Language Engineering Standards
The XML Cover Pages - Home Page excellent resource list by Robin Cover
Multext large-scale corpora and tools project from the Centre National de la Recherche
Scientifique (France)
Talkbank multimedia database project from Carnegie Mellon University
Synchronized Multimedia Integration Language from the W3C
Survey of the State of the Art in Human Language Technology
EAGLES/ISLE Meta Data Initiative
Corpus Encoding Standard part of the EAGLES initiative
XCES XML version of CES
Tipster main site
Tipster Architecture info
ATLAS: A Flexible and Extensible Architecture for Linguistic Annotation
EUDICO European Distributed Corpora Project
Corpus Retrieval Tools
•
•
•
•
•
•
•
Concordancers FTP downloads
TACT (Text Analysis Computing Tools) DOS Concordancer from the University of Toronto
LTG Software tools for text processing (including XML) from Edinburgh
On-line corpus analysis Web-based concordance generator (in German) for texts in French,
Italian and Spanish
Software Tools for NLP list from Kita Lab (Japan)
NLSR Natural Language Software Registry
CRATER tools and resources for multilingual corpus work
Teaching and Learning
•
•
•
•
Teaching and Language Corpora article in ReCALL by T. McEnery and A. Wilson (PDF)
Tutorial: Concordances and Corpora Web-based introduction by Catherine Ball (Georgetown)
Corpora in the Teaching of Languages and Linguistics
Can the rate of lexical acquisition from reading be increased? case study in concordance use in
reading
Language Learning & Technology
11
Bob Godwin-Jones
•
•
•
Emerging Technologies
Pruebas de PHP-KWIC Web-based concordance general for Spanish texts (in Spanish)
Corpus of Historical and Modern Spanish Web-based access to large Spanish corpus (from Mark
Davies)
VLC Web Concordancer search options in Chinese, English, French, Japanese, as well as parallel
texts
Language Learning & Technology
12
Language Learning & Technology
http://llt.msu.edu/vol5num3/announcements/
September 2001, Vol. 5, Num. 3
pp. 13-18
NEWS FROM SPONSORING ORGANIZATIONS
This page includes announcements from the organizations sponsoring LLT.
University of Hawai'i National Foreign Language Resource
Center (NFLRC)
Less commonly taught languages, particularly those of Asia and the Pacific, are the focus of the
University of Hawai`i National Foreign Language Resource Center, which engages in research and
materials development projects and conducts Summer Institutes for language professionals among its
many activities.
PACIFIC SECOND LANGUAGE RESEARCH FORUM (PacSLRF)
2001 The NFLRC is pleased to co-sponsor the upcoming PacSLRF 2001
Conference, which will be held at the Imin Conference Center on the
University of Hawai'i at Manoa campus October 4-7, 2001. This
international conference will focus on the acquisition of second languages
in instructed and naturalistic settings, particularly in East Asian, Southeast
Asian, and Pacific languages. Questions? Contact us at pacslrf@hawaii.edu.
NEW PUBLICATIONS FROM THE UH NFLRC
•
•
A Focus on Language Test Development: Expanding the Language Proficiency Construct Across a
Variety of Tests by T. Hudson & J. D.Brown (Eds.). This volume presents eight research studies
which introduce a variety of novel, non-traditional forms of second and foreign language assessment.
To the extent possible, the studies also show the entire test development process, warts and all. These
language testing projects not only demonstrate many of the types of problems that test developers run
into in the real world but also afford the reader unique insights into the language test development
process.
Motivation and Second Language Learning by Z. Dörnyei & R. Schmidt (Eds.). This volume, the
second in this series concerned with motivation and foreign language learning, includes papers
presented in a state-of-the-art colloquium on L2 motivation at the American Association for Applied
Linguistics Conference (Vancouver, 2000) and a number of specially commissioned studies. The 20
chapters, written by some of the best known researchers in the field, cover a wide range of theoretical
and research methodological issues, and also offer empirical results (both qualitative and quantitative)
concerning the learning of many different languages (Arabic, Chinese, English, Filipino, French,
German, Hindi, Italian, Japanese, Russian, and Spanish) in a broad range of learning contexts
(Bahrain, Brazil, Canada, Egypt, Finland, Hungary, Ireland, Israel, Japan, Spain, and the US).
Additions have also been made to the NFLRC NetWorks collection of online publications. Check out our
other publications at http://www.LLL.hawaii.edu/nflrc/publication.html.
.
Copyright 2001, ISSN 1094-3501
13
News from Sponsoring Organizations
Michigan State University Center for Language Education
and Research (CLEAR)
CLEAR’s mission is to promote foreign language education in the United States. To meet its goals,
CLEAR’s projects focus on foreign language research, materials development, and teacher training.
FOREIGN LANGUAGE RESEARCH
•
•
•
Acquisition of Prosody by English-Speaking Learners of French
Feedback and Interaction
Longitudinal Analysis of Foreign Language Writing Development
MATERIALS DEVELOPMENT
Products
•
•
•
•
•
•
•
•
•
Business Chinese (CD-ROM)
Modules for Assessing Socio-Cultural Competence for German (CD-ROM)
Pronunciación y fonética (CD-ROM)
African Language Tutorial Guide (guide and video)
Foreign Languages: Doors to Opportunity (video and discussion guide)
Task-based Communicative Grammar Activities for Japanese and Thai (workbook)
Test Development (workbook and video)
The Internet Sourcebook for Business German
Business Language Packets for High School Classrooms (French, German, and Spanish)
Coming Soon!
•
•
•
•
Portuguese Pronunciation and Phonetics CD-ROM
Modules for Assessing Socio-Cultural Competence for Russian (CD-ROM)
Thai Tutorial Guide
The Internet Sourcebook for Business Spanish
Game-O-Matic
The Game-O-Matic is a suite of wizards that create Web-based activities for language learning and
practice. Teachers can make original Game-O-Matic games by visiting http://clear.msu.edu/dennie/matic/.
Have a new idea for a Game-O-Matic activity? Contact Dennie Hoopingarner at hooping4@msu.edu.
Newsletter
CLEAR News is a biyearly publication covering FL teaching techniques, research, and materials. Contact
the CLEAR office to join the mailing list or see it on the Web at http://clear.msu.edu/clearnews/.
TEACHER TRAINING
Summer Workshops
Every summer, CLEAR offers teacher development workshops for foreign language educators to help
strengthen and expand their teaching skills. CLEAR offers stipends to help defray the workshop fees and
travel/accommodation expenses. For more information, see CLEAR’s Web site at http://clear.msu.edu.
For more information, contact
Center for Language Education And Research (CLEAR)
A712 Wells Hall
Michigan State University
East Lansing, MI 48824-1027
Language Learning & Technology
Phone: 517/432-2286
Fax: 517/432-0473
Email: clear@msu.edu
14
News from Sponsoring Organizations
Apprentissage des Langues et Systèmes d'Information et
de Communication (ALSIC)
ALSIC (Language Learning and Information and Communication Systems) is an electronic journal in
French for researchers and practitioners in fields related to applied linguistics, didactics,
psycholinguistics, educational sciences, computational linguistics, and computer science. The journal
gives priority to papers from the French-speaking community and/or in French, but it also regularly
invites papers in other languages so as to strengthen scientific and technical exchanges between linguistic
communities that too often remain separate. The editorial board of ALSIC invites you to contact them for
any prospective contributions at the following electronic address: alsic@lifc.univ-fcomte.fr.
The Australian Technology Enhanced Language Learning
Consortium (ATELL)
Contacts:
Dr. Mike Levy, The University of Queensland (mlevy@lingua.arts.uq.edu.au)
Prof. Roly Sussex, The University of Queensland (sussex@lingua.arts.uq.edu.au)
ATELL is an informal collaboration of Australian language teachers involved in technology-enhanced
language learning and teaching. It has recently been moved to The University of Queensland, where Dr.
Mike Levy and Professor Roly Sussex are developing the concept in collaboration with Mr. Greg
Dabelstein, coordinator of the CALL special interest group of the Association of Modern Language
Teachers' Associations of Australia (AFMTLA). We intend to establish a network of complementary and
collaborating resources for teachers and learners in the TELL domain in schools and tertiary institutions.
There will be a Web site, which will include information, collaboration, and resources such as
•
•
•
•
•
•
•
•
•
•
a register of Australian TELL experts
links to other sites with TELL-related information and materials
links to reviews of hardware, software, courseware
a section for FAQs (Frequently Asked Questions)
what's new -- ideas, research, materials
a register of projects, current and past, in TELL research, development, implementation
software modules, libraries, and related resources for developers
audio and video files for language learning support
policies and discussion
special interest groups
In addition, we are reviving the ATELL mailing list, whose e-mail location is atell@lingua.arts.uq.edu.au.
ATELL is supported by the Language Laboratory at the University of Queensland.
Language Learning & Technology
15
News from Sponsoring Organizations
Center for Advanced Research on Language Acquisition,
University of Minnesota (CARLA)
CARLA is one of nine National Language Resource Centers whose role is to improve the nation's
capacity to teach and learn foreign languages effectively. Launched in 1993 with funding from the
national Title VI Language Resource Center program of the U.S. Department of Education, CARLA's
mission is to study multilingualism and multiculturalism, to develop knowledge of second language
acquisition, and to advanced the quality of second language teaching, learning, and assessment by
conducting research and action projects sharing research-based and other forms of knowledge across
disciplines and education systems extending, exchanging, and applying this knowledge in the wider
society.
CARLA's research and action initiatives include a focus on the articulation of language instruction,
content-based language teaching through technology, culture and language studies, less commonly taught
languages, language immersion education, second language assessment, second language learning
strategies, and technology and second language learning.
To share its latest research and program opportunities with language teachers around the country,
CARLA offers the following resources: a summer institute program for teachers; a database which lists
where less commonly taught languages are taught throughout the country; listservs for teachers of less
commonly taught languages and immersion educators; a working paper series; conferences and
workshops; and a battery of instruments in French, German, and Spanish for assessing learners'
proficiency in reading, writing, speaking, and listening at the intermediate-low level on the ACTFL scale.
Check out these and other CARLA resources on the CARLA Web site at http://carla.acad.umn.edu.
The Center for Applied Linguistics (CAL)
The Center for Applied Linguistics is a private, nonprofit organization that promotes and improves the
teaching and learning of languages, identifies and solves problems related to language and culture, and
serves as a resource for information about language and culture. CAL carries out a wide range of
activities in the fields of English as a second language, foreign languages, cultural education, and
linguistics. These activities include research, teacher education, information dissemination, instructional
design, conference planning, technical assistance, program evaluation, and policy analysis. Publications
include books on language education, online databases of language programs and assessments, curricula,
research reports, teacher training materials, and print and online newsletters.
Major CAL projects include the following:
•
•
•
•
ERIC Clearinghouse on Languages and Linguistics
National Clearinghouse for ESL Literacy Education
Refugee Service Center
Pre-K-12 School Services
CAL collaborates with other language education organizations on the following projects:
•
•
•
•
Center for Research on Education, Diversity & Excellence
Improving Foreign Languages in the Schools Project of the Northeast and Island Regional Laboratory
at Brown University
National Capitol Language Resource Center
National K-12 Foreign Language Resource Center
Language Learning & Technology
16
News from Sponsoring Organizations
•
National Network for Early Language Learning
News from the ERIC Clearinghouse on Languages and Linguistics
•
ERIC/CLL’s quarterly online newsletter, ERIC/CLL Language Link, covers current topics in
language education. Recent articles in Language Link include a review of the 2000 US Census and
its implications for language educators, CoBaLTT (computer-assisted language learning), profiles of
effective Early Foreign Language Programs, and a Language Policy update.
•
Recent ERIC/CLL Digests cover a range of topics in ESL, foreign language, and bilingual education
including our newest Digest, Lexical Approach to Second Language Teaching.
News from the National Center for ESL Literacy Education
Facts and Statistics Related to Adult ESL provides links to resources that NCLE most often consult for
statistics on adult ESL and the populations served by adult ESL programs.
The latest NCLE Digest, Reflective Teaching Practice in Adult ESL Settings offers the adult ESL
practitioner background information and step-by-step suggestions for using reflective processes as a tool
for professional development.
Computer Assisted Language Instruction Consortium
(CALICO)
Since its inception in 1983, CALICO has served as an international forum for language teachers who
want to develop and utilize the potential of advanced technology to support their teaching and research
needs. Through its Annual Symposia, Special Interest Groups (SIGs), CALICO Journal, CALICO
Monograph Series, CALICO Resource Guide, and numerous other publications, CALICO provides both
leadership and perspective in the ever-changing field of computer-assisted instruction. The strength of
CALICO derives from the enthusiasm, creativity, and diversity of its members. It comprises language
teachers and researchers from universities, military academies, community colleges, K-12 schools,
government agencies, and commercial enterprises. To learn more about CALICO activities and how to
participate in them, visit the CALICO homepage at http://www.calico.org.
Language Learning & Technology
17
News from Sponsoring Organizations
European Association for Computer Assisted Language
Learning (EUROCALL)
EUROCALL is an association of language teaching professionals from Europe and worldwide aiming to
•
•
•
Promote the use of foreign languages within Europe
Provide a European focus for all aspects of the use of technology for language learning
Enhance the quality, dissemination, and efficiency of CALL materials
EUROCALL's journal, ReCALL, published by Cambridge University Press, is one of the leading
academic journals covering research into computer-assisted and technology-enhanced language learning.
The association organises special interest meetings and annual conferences, and works towards the
exploitation of electronic communications systems for language learning. For those involved in education
and training, EUROCALL provides information and advice on all aspects of the use of technology for
language learning.
Forthcoming EUROCALL conferences
•
EUROCALL 2001 will be at the University of Nijmegen, The Netherlands, 30 August to 1 September
2001.
•
EUROCALL 2002 will be at the University of Jyväskylä, Finland, 14 - 17 August 2002.
For full details, contact us at http://www.eurocall.org.
International Association for Language Learning Technology
(IALLT)
Established in 1965, IALLT (formerly IALL) is a professional organization whose members provide
leadership in the development, integration, evaluation, and management of instructional technology for
the teaching and learning of language, literature, and culture. Its strong sense of community promotes the
sharing of expertise in a variety of educational contexts. Members include directors and staff of language
labs, resource or media centers, language teachers at all levels, developers and vendors of hardware and
software, grant project developers and others. IALLT offers biennial conferences, regional groups and
meetings, the LLTI listserv (Language Learning Technology International), and key publications such as
the IALL Journal, the IALL Language Center Design Kit, and the IALL Lab Management Manual. The
2003 IALLT conference will be held at the University of Michigan, June 17 - 21. For information, visit
the IALLT Web site at www.iallt.org/.
Language Learning & Technology
18
Language Learning & Technology
http://llt.msu.edu/vol5num3/review1/
September 2001, Vol. 5, Num. 3
pp. 19-23
REVIEW OF
MULTILINGUAL CORPORA IN TEACHING AND RESEARCH
Multilingual Corpora in Teaching and Research
(From the series Language and Computers: Studies in Practical Linguistics, No
22)
Simon P. Botley, Anthony M. McEnery, and Andrew Wilson, Eds.
2000
ISBN: 90-420-0541-6
US $19.00 (Paperback)
208 + vi
Editions Rodopi B.V.
Amsterdam (Netherlands) and Atlanta, GA (USA)
Reviewed by John M. Lawler, University of Michigan.
Multilingual corpora are those consisting of texts in more than one language, often a monolingual original
and a translation. These translations vary greatly in their faithfulness, accuracy, style, and order of
presentation, as well as in granularity of translation, that is, the size of the chunks being translated (e.g.,
word-to-word, sentence-to-sentence, paragraph-to-paragraph, or idea-to-idea). Since the reasons for
constructing multilingual corpora include being able to correlate individual pieces of one text with
corresponding parts of another, their use immediately raises the problem of text alignment, or computing
which chunk of a text in one language corresponds to a given chunk of the parallel text in another
language.
This is the major focus of Multilingual Corpora in Teaching and Research. Indeed, this book could more
accurately have been titled Text Alignment in Multilingual Corpora: Overview and Case Studies. Text
alignment, it quickly becomes clear, is the outstanding problem in research on multilingual corpora, and
thus -- to the extent that progress has been made in its solution -- its outstanding success story. The
problems that arise in alignment research reprise practically every issue in Natural Language Processing
(NLP) and Automatic Translation, (e.g., sentence division, anaphor tracking, ambiguity resolution), and
the peculiar limitations of the alignment task make the application of alignment strategies to these broader
problems surprisingly productive, as is discussed in detail in this volume.
Multilingual Corpora consists of two introductory chapters, covering theoretical and methodological
issues, the literature, and the state of the art (up to early 1998), as well as 10 individual case studies, each
describing an existing corpus project, 2 in the US and the rest in Europe. All the case studies except the
last (on problems aligning English and Chinese texts) deal strictly with Indo-European languages
(Danish, English, French, German, Greek, Italian, Norwegian, Spanish, and Swedish) and most of the
corpora discussed contain texts in just two languages.
Chapter 1, "Bilingual Text Alignment -- An Overview," by Michael Oakes and Tony McEnery (one of the
editors) of Lancaster University, is typical of recent work in CL/NLP in that it distinguishes sharply
between statistical and linguistic methods of text alignment. As these authors put it (p. 4) "Statistical
methods tend to work better for large corpora, since they are relatively rapid, while linguistic methods can
be better for small corpora." The vast majority of the article is a survey of the statistical methods used in
various alignment projects, including formulae and discussion of results, although three varieties of
Copyright 2001, ISSN 1094-3501
19
Language Learning & Technology
http://llt.msu.edu/vol5num3/review1/
September 2001, Vol. 5, Num. 3
pp. 19-23
linguistic techniques are also covered. This disparity reflects the simple fact that statistically-based NLP
has been far more successful overall than linguistically-based approaches, especially in tasks involving
corpora (see Bayer, Aberdeen, Burger, Hirschman, Palmer, and Vilain [1998] and Hoard [1998] for
discussion.).
Chapter 2, "Bilingual Text Alignment: Where Do We Draw the Line?" by Michel Simard, George Foster,
Marie-Loise Hannan, Elliott Macklovitch, and Pierre Plamondon of Canada's Centre d'Innovation en
Technologies de l'Information, takes up the question of granularity in the context of Isabelle's (1993)
concept of Translation Analysis (TA), that is, "the reconstruction of the correspondences between
segments of a source text and segments of its translation" (p. 39), a principled approach to alignment.
Before concluding on a generally sanguine note, they discuss three alignment programs at different
granularity levels: JACAL (Just Another Cognate ALignment program), a character-level program;
Salign, a sentence-level program that can be used in conjunction with JACAL (though it need not be); and
TMAlign, a lexical-level alignment program.
Chapter 3, "Corpus and Terminology: Software for the Translation Program at Göteborgs Universitet, or
Getting Students to Do the Work," by Pernilla Daniellson and Daniel Ridings, deals with a suite of
programs developed for training translators. This is one of the most obvious educational uses of
multilingual corpora; the software described here is designed to be used by future translators to pick out
"terminology" (i.e., technical terms that may be unfamiliar outside a particular specialty) in context, and
create their own personal terminology bank for future use, in the process learning a great deal about
translation. It is built from more or less off-the-shelf software (i.e., Microsoft Access) and is seen to be
robust, simple, and easy to use, as well as meeting the needs of students.
Chapter 4, "Parallel and Comparable Bilingual Corpora in Language Teaching and Learning," by Carol
Peters, Eugenio Picchi, and Lisa Biagini of Istituto di Linguistica Computazionale in Pisa, discusses the
interesting distinction between parallel corpora, or "translationally equivalent texts," and comparable
corpora, for which they adopt Laffling's (1992) description: "texts which, though composed
independently in their respective language communities, have the same communicative function."
PiSystem DBT, an Italian/English bilingual text query program implemented for language learners, is used
to highlight these issues in this chapter. A demo version is available on the Web at
http://www.ilc.pi.cnr.it/pisystem/demo/demo_dbt/demo_bilingui/index.htm (this is a different URL from
the one given in the book, which now returns an error message). As expected, analyses of comparable
corpora are more difficult and pose unique problems. Thus, the implementation discussed is still
experimental.
In chapter 5, "Using Authentic Corpora and Language Tools for Adult-Centred Learning," Renée Meyer,
Mary Ellen Okurowski, and Thérèse Hand of New Mexico State University explore an application,
OLEADA (not an acronym, but rather the Spanish word for "tidal wave"), developed at NMSU. OLEADA
is a complete learning environment, integrating "three language technologies: on-line text corpora,
information retrieval, and language analysis tools. A single user interface allows seamless access to the
texts and tools in ten languages" (p. 87). This short chapter doesn't go into design or performance
specifics, but rather concentrates on the varying uses of OLEADA's three customer groups: language
training developers, classroom developers, and independent students.
Chapter 6, "Teaching Terminology Using Electronic Resources," by Jennifer Pearson of Dublin City
University, is concerned, like Chapter 3, with an application designed to help future translators experience
and learn to handle real use of technical jargon and phrases of art in a realistic context. This is an
extremely interesting chapter, with many examples of terminological variation, and especially of culturespecific terms for which there are usually no good equivalents.
Copyright 2001, ISSN 1094-3501
20
Language Learning & Technology
http://llt.msu.edu/vol5num3/review1/
September 2001, Vol. 5, Num. 3
pp. 19-23
Chapter 7, "Parallel Texts in Language Teaching," by Michael Barlow of Rice University, shows how
even a simple concordance program (ParaConc, a simple parallel version of Barlow's MonoConc,
reviewed this issue and by Lawler, 2000) can be of great use to teachers and students for exploring the
wide variety of ways in which a single word or phrase gets translated, especially as part of an idiomatic or
metaphoric expression. The result, as anyone who's spent enough time with a good bilingual dictionary
can attest, can be eye-opening.
David Woolls of Birmingham University, extends this concept in a different direction in Chapter 8, "From
Purity to Pragmatism; User-Driven Development of a Multilingual Parallel Concordancer." The software
involved, part of the European Union's LINGUA project, produces various types of concordances over
parallel texts in Danish, English, French, German, Greek, and Italian. Rather than focusing on its usage
and applications, the chapter is a developmental history of the program, from initial specifications through
iterative cycles of construction, testing, and revision of the corpus and the various software tools
associated with it, and the inevitable problems that arose at each stage, and how they were handled -generally by downsizing expectations. This is an article that can be read with sympathy and profit by
anyone involved in large-scale distributed development schemes.
Chapter 9, "The English-Norwegian Parallel Corpus: Current Work and New Directions," by Stig
Johansson and Knut Hofland of the University of Oslo, is a progress report on an ongoing project, with
sections on its uses and recent multilingual extensions to French and German parallel corpora. Of
particular linguistic interest are the extensive discussions, with examples, of the occurrence of the
Norwegian modals skal (p. 135) and nok (p. 137); modals are often problematic, but examples like this
can help understand something of their vagaries. The section on multilingual extensions is highlighted by
an equally extensive and equally interesting discussion of cleft sentences ("That's what I meant," and its
ilk) and other clausal anaphora, and their translated equivalents; any syntactician reading this section
would yearn for such a tool. This is a good example of how corpus linguistics can inform theoretical
linguistics, as well as language learning.
Chapter 10, "Unlocking the power of the SMEMUC," by Raphael Salkie, of the University of Brighton,
coins what the author admits is an "ugly acronym" for Small and MEdium-sized MUltilingual Corpus. He
argues that such corpora are "a good way forward for those of us who want to take corpora out of the
computer laboratory and into the hands of teachers, students, and language researchers," (p. 148) and goes
on to describe the step-by-step development and subsequent pedagogic uses of INTERSECT, a FrenchEnglish parallel corpus massaged to fit the needs of ParaConc (discussed in Chapter 7). His conclusion is
one that is easy to agree with: "Sometime in the future, when today's computers seem like little toys and
the Internet is fast and freely available, large multilingual corpora will be available for everyone. For
now, it is corpora like INTERSECT which can take a lead in convincing linguists, language teachers and
translators that multilingual corpora have a lot to offer them" (p. 156).
Chapter 11, "Corpus-Based Contrastive Lexicography: The Case of English with and its German
Translation Equivalents," by Josef Schmied and Barbara Fink of the University of Chemnitz, focuses on
the use of a bilingual parallel corpus to research the syntax and semantics of the preposition with, in all its
uses and collocations. The lexicographic results are the stars here, while the software plays a supporting
role; this is a good example of the kind of research that would have been impossible even to conceive of,
let alone carry out, before the advent of aligned multilingual corpora. It will be of interest not only to
computational linguists, but also to translators, semanticists, lexicographers, and language teachers.
Finally, Chapter 12, "Parallel Alignment in English and Chinese," by Tony McEnery, Scott Piao, and Xu
Xin of the University of Lancaster, addresses the challenges for multilingual parallel corpus research
posed by non-European and non-Indo-European languages. Many new methods are still needed, and so
far the work is largely experimental and the results rather sketchy. Nevertheless, the authors produce a
useful discussion of the problems they encountered and report on one alignment method, based on bi-
Copyright 2001, ISSN 1094-3501
21
Language Learning & Technology
http://llt.msu.edu/vol5num3/review1/
September 2001, Vol. 5, Num. 3
pp. 19-23
variate distribution, that they tried out on a sample corpus. They conclude, "Aligning languages which are
not genetically related is a challenge for computational linguists, and may well stretch the 'language
independence' claim of some current alignment algorithms to the breaking point." The chapter includes an
appendix containing a short set of tags that were used in the alignment task.
Overall, this is a really interesting book for a linguist to read. All the articles are well-written and
accessible at any level of knowledge about corpora (although readers of chapters 1 and 12 might benefit
from familiarity with Oakes, 1998), and the problems encountered are diverse and challenging enough to
engage anyone with an interest in language. This would serve nicely as a source of additional readings for
courses in corpus linguistics, translation theory, or software design, as well as being a good source of
good ideas and potential pitfalls for corpus and software designers themselves.
For such a useful book, though, it is a shame that the index is so sparse, consisting of only seven pages,
each of which is mostly white space, with one 12-character-wide column on either side. The index could
have been printed in three pages with more appropriate use of space, especially when one considers that
the entry for "standard error," a cross-reference to the immediately preceding entry for "standard
deviation" on page 207, takes up an entire quarter-page (see Figure 1).
Figure 1. Index entries
Copyright 2001, ISSN 1094-3501
22
Language Learning & Technology
http://llt.msu.edu/vol5num3/review1/
September 2001, Vol. 5, Num. 3
pp. 19-23
Indexes are hard to make, and good quality control is often outside the reach even of editors, but a wellmade index repays an editor's labor in the form of usefulness for readers. There are a few other
infelicities; in addition to the ones remarked on in Dash (2001), such as the absence of Section 3.1.1
mentioned on page 179, I might add the running head for chapter 7, which renames the chapter to
"Parallel texts in English teaching."
But all these are very minor matters; this is a really good book, worth its price and bound to be useful for
a long time to come.
ABOUT THE REVIEWER
John Lawler, Associate Professor of Linguistics at the University of Michigan, Ann Arbor, former chair
of the LSA Computer Committee, and software author (MONOSYL, A World of Words, The
Chomskybot), has published on topics including metaphor, Acehnese syntax, generic reference, secondlanguage learning, English syntax and semantics, negation and logic, sound symbolism, UNIX, and
popular English usage, and has consulted on software development for industry and academia.
E-mail: jlawler@umich.edu
REFERENCES
Bayer, S., Aberdeen, J., Burger, J., Hirschman, L., Palmer, D., & Vilain, M. (1998). Theoretical and
computational linguistics: Toward a mutual understanding. In J. Lawler & H. Dry (Eds.), Using
Computers in Linguistics (pp. 231-255). New York: Routledge. A chapter overview is availableon the
Web: http://www.routledge.com/linguistics/introduction.html#chapter.8.
Dash, N. S. (2001). Review of Botley, McEnery, & Wilson (2000), Multilingual Corpora in Teaching
and Research. LINGUIST, 11(2537). Retreived June 1, 2001 from the World Wide Web:
http://linguistlist.org/issues/11/11-2537.html.
Hoard, J. E. (1998). Language Understanding and the Emerging Alignment of Linguistics and Natural
Language Processing. In J. Lawler & H. Dry (Eds.), Using Computers in Linguistics (pp. 197-230). New
York: Routledge. A chapter overview is available on the Web:
http://www.routledge.com/linguistics/introduction.html#chapter.7.
Laffling, J. (1992). On Constructing a Transfer Dictionary for Man and Machine. Target 4(1), 17-31.
Lawler, J. M. (2000). Review of MonoConc Pro 2.0 Concordancing Software. LINGUIST, 11(1411).
Retrieved June 1, 2001 from the World Wide Web: http://linguistlist.org/issues/11/11-1411.html.
Oakes, M. (1998). Statistics for corpus linguistics. Edinburgh: Edinburgh University Press.
Copyright 2001, ISSN 1094-3501
23
Language Learning & Technology
http://llt.msu.edu/vol5num3/review2/
September 2001, Vol. 5, Num. 3
pp. 24-27
REVIEW OF PATTERNS AND MEANINGS: USING CORPORA FOR
ENGLISH LANGUAGE RESEARCH AND TEACHING
Patterns and Meanings: Using Corpora for English Language
Research and Teaching
Alan Partington
Studies in Corpus Linguistics
Elena Tognini-Bonelli, series editor:
1998
ISBN 1 55619 396 3
US $ 27.95 (paperback)
163 + vii pp.
John Benjamins Publishing Company
Amsterdam, The Netherlands
Reviewed by József Horváth, University of Pécs
Patterns and Meanings: Using Corpora for English Language Research and Teaching, Partington's slim
but thorough monograph, is a welcome contribution to the field of corpus linguistics. It illustrates how
using computer corpora in the study of language phenomena can enhance the internal validity and
reliability of linguistic findings. The volume (the second in the Studies in Corpus Linguistics series)
represents an example of the uses of corpora for practical purposes, following, in part, the paradigm
established by Johns (1991) and Leech (1997): to exploit corpora for language teaching and learning.
Partington, together with colleagues, assembled an unannotated corpus of 5 million words of journalistic
texts for the case studies at the University of Bologna. The English component of the corpus was derived
from The Independent, The Telegraph and The Times, with what the author calls the "sister" subcorpus
accessed from an Italian broadsheet, Il Sore 24 Ore. Two heavy-weight concordancers, Microconcord
(Scott & Johns, 1993) and WordSmith Tools (Scott, 1996, reviewed in this issue), were used for the
analyses.
The eight chapters of Patterns and Meanings report on what the author labels case studies, addressing
different levels or aspects of language use. The coverage is broad: discussions of collocation, translation,
connotation, syntax, cohesion, metaphor, and phraseology signify the main stages of the effort, each of
which combines "language description with suggestions for pedagogical application" (p. 1). The style and
presentation are superb, with only a few slips and minor typos (such as the one on p. 63, "the number of
wholly reliable true friends … is probable fewer than is … imagined"). The author ties in his observations
with a useful and clearly presented review of the literature, which presents some contrasting views.
Further, the reader is given a concrete description of the methods, procedures, and techniques applied in
the analyses. In the concluding sections of most chapters, Partington charts directions for further study,
and frequently offers useful and original tips for both teachers and students in corpus linguistics courses.
As the intended audience of the volume includes newcomers to the field of corpus linguistics, the
Introduction defines most basic terms and issues, referring mainly to studies from the 1990s. Special
emphasis is given to areas where the application of corpus linguistics in language pedagogy can plug the
gap that some practitioners perceive between theory and practice, and between teaching and learning. It
Copyright 2001, ISSN 1094-3501
24
Reviewed by József Horváth
Review of Patterns and Meanings…
includes a relevant description of the data-driven learning (DDL) approach as well as details about the
corpus for the studies and the methods applied. To ensure that the following chapters are accessible to all
readers, there is also a brief illustrative section on the keyword-in-context (KWIC) concordance output, a
description of the sorting features of concordance programs, and a technical how-to for dealing with
corpus files on a computer.
In the first three chapters ("Collocation and Phrase Patterns," "Collocation and Synonymy," and "True
and False Friends") the focus is on lexical issues. The author gives a splendid introduction to how
concordance samples can enhance our understanding of denotation and contrasts data from the corpus
with information from dictionaries to highlight the interaction of collocation, text types, and stylistic
variations. The definitions are solid, and the examples carefully selected. Furthermore, principles
underlying the phenomena are always explained clearly, assisting the reader in discovering the
significance of the findings. Especially revealing is the study of collocation and synonymy (chapter 2), in
which one of the problems that many EFL learners and professional translators face is tackled: choosing
among seemingly similar vocabulary items. Partington provides a detailed study of the collocates of the
adjectives sheer, pure, complete, and absolute, as he investigates the many different lexical choice
patterns, making this part of the book a commendable resource aiding translation theory and practice.
Although the approach and the findings of chapter 2 are valid, I have difficulty seeing the relevance to
Partington's claim in the conclusion that a thesaurus "is positively dangerous for the non-native speaker."
For one thing, the use of the term "non-native speaker" is problematic. We all are native speakers of one
language or another. The author may be alluding to the dichotomy of the L1 and L2 speaker, and the point
he is making appears to be that, because of the intricacies of collocation and synonymy, the use of a
thesaurus may result in strange, non-native-like language. Does this suggest, then, that Partington would
not sanction the use of the thesaurus in any EFL course? Whether or not Partington would go this far, the
claim appears to be based on a limited view of both the foreign language learner and the pedagogical
context of using a thesaurus. Students can learn how to use a thesaurus for specific purposes, the same
way as they learn to use a dictionary -- either traditional or corpus-based. Also, suggesting that
thesauruses represent a danger raises the issue of equal linguistic rights, to which learners are entitled as
much as native speakers. It would be interesting to conduct an empirical comparison of the naturalness of
the writing of learners who used a thesaurus and those who did not for a given writing task. In addition,
learner use of a thesaurus during DDL work may result in improvement in range and appropriateness of
vocabulary, making this another area worthy of empirical research.
Chapter 4 continues the exploration of the corpus by examining connotation in terms of semantic
prosody: It investigates the connotational significance of lexis. Definitions and examples were extracted
from 10 dictionaries, both traditional and corpus-based, so that the corpus findings could be contrasted
with how the lexical and connotational features of set in, peddle, and dealings are presented in the two
groups of dictionaries. Partington notes that even current non-learner dictionaries have little place for
information of this kind (p. 72) and suggests that cross-linguistic prosodic differences require further
study, which will be especially beneficial in terms of raising translators' awareness of them.
Chapters 5 and 6 ("Syntax" and "Cohesion in Text") serve two purposes: first, to identify further features
of patterns and meanings; second, to demonstrate that with a corpus one can go beyond the lexical domain
and look at other chunks of text. It is also here that the author makes his theoretical position explicit: He
belongs to the school that investigates "the interface between lexis and syntax" (p. 79). Partington refers
to Francis's claim that the lexical and the syntactic domains are mutually dependent on each other: "It is
impossible to look at one independently of the other …. The interdependence of syntax and lexis is such
that they are ultimately inseparable" (1993, p. 147). Partington's analysis reveals that what is taught about
conditionals, for example, is not always what the corpus attests. At one point (p. 84), he suggests that
when students can review and analyze a large number of concordance citations for "If," they may realize
that what underlies the syntax of conditionals is best viewed as a model, rather than a constraint. By
Language Learning & Technology
25
Reviewed by József Horváth
Review of Patterns and Meanings…
analyzing the corpus, Partington reveals and groups conditional and non-conditional dependencies in if
constructions, and suggests that similar investigations could be carried out on other conditional markers
and subordinators. He also states that this DDL approach can help students "more clearly understand the
distinctions highlighted in grammars and textbooks" (p. 87). Unfortunately, however, there are no
concrete tips on the format and content of this procedure, although many readers may have found such a
practical element interesting.
"Metaphor" and "'Unusuality'" come toward the end of the book (chapters 7 and 8). The former applies
frequency and concordance data drawn from the business journalism section of the English corpus (about
800,000 words from The Independent and The Times), the latter undertakes to highlight not the typical,
but the figurative, in language. In chapter 7, the author first provides a succinct summary of three theories
of metaphor, and then analyzes dead and dying metaphors, metaphorical intent, collocation, and fossilized
collocations. In chapter 8, he presents unusual newspaper headlines from five sections of The
Independent: home news, international news, arts, business, and sports. Clearly, headline language is text
that many EFL students will skim. Here we get a scanning of this sample: The focus is on preconstructed
word strings (proverbs, quotations, expressions, and the like). The list of headlines assembled is a rich
resource: examples such as "Prints Charming," "Sail of the Quincentenary," "You Could Hear A
Superlative Drop," and "Industrial Resolution" are just four of the scores of examples that have been
classified and interpreted by the author (and his associates). In addition to presenting examples of
collocational patterns, Partington shares his view of the sociolinguistic and psycholinguistic nature of
these journalistic chunks.
The concluding chapter addresses some of the limitations of corpus-based studies, with Partington
synthesizing the most common critiques leveled against the approach. Issues discussed include the
difficulty of establishing external validity of corpus-based studies, which results from the fact that any
findings can be interpreted only within the context of the given corpus. It seems, however, that the
theoretical dilemmas of representativity would have been better placed, and more thoroughly analyzed, in
an earlier section, before the case studies. This would have helped readers new to the field keep the
limitation in mind. That Partington chose to feature this subject at the end of the book suggests that he
had found no easy answer to a question of corpus linguistics: Why bother analyzing any corpus, however
large, if what is found can be claimed to characterize only one single corpus? For the time being, one
needs to be content with exploring general and specialized corpora assembled using clear principles and
be cautious in drawing conclusions from such studies (Clear, 1992).
Overall, Partington's richly illustrated studies, the relevance of research questions to language pedagogy,
and the new knowledge that this volume offers, especially about collocation, synonymy, phraseology, and
unusual language, make Patterns and Meanings a very well focused and engaging read, one that has
already found its way into several corpus linguistics courses worldwide -- rightfully so.
ABOUT THE REVIEWER
József Horváth holds a PhD in Applied Linguistics from the University of Pécs (Hungary). He has
developed the JPU Corpus, a collection of over 400,000 words of Hungarian EFL students' writing. He
teaches Writing and Research Skills, Corpus Linguistics, and Translation Studies courses at the
Department of English Applied Linguistics, University of Pécs
E-mail: joe@btk.pte.hu
Language Learning & Technology
26
Reviewed by József Horváth
Review of Patterns and Meanings…
REFERENCES
Clear, J. (1992). Corpus sampling. In G. Leitner (Ed.), New directions in English language corpora:
Methodology, results, software development (pp. 21-31). Berlin: Mouton de Gruyter.
Francis, G. (1993). A corpus-driven approach to grammar: Principles, methods and examples. In M.
Baker, G. Francis, & E. Tognelli-Bonelli (Eds.), Text and technology: In honour of John Sinclair (pp.
137-156). Amsterdam: John Benjamins.
Johns, T. (1991). Should you be persuaded: Two examples of data-driven learning. ELR Journal, 4, 1-16.
Leech, G. (1997). Teaching and language corpora: A convergence. In A. Wichmann, S. Fligelstone, T.
McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 2-23). London: Longman.
Scott, M. (1996). Wordsmith tools [Computer software]. Oxford, UK: Oxford University Press.
Scott, M., & Johns, T. (1993). Microconcord [Computer software]. Oxford, UK: Oxford University Press.
Language Learning & Technology
27
Language Learning & Technology
http://llt.msu.edu/vol5num3/review3/
September 2001, Vol. 5, Num. 3
pp. 28-31
REVIEW OF EXPLORING ACADEMIC ENGLISH
Exploring Academic English: A Workbook for Student
Essay Writing
Jennifer Thurstun and Christopher Candlin
1997
ISBN 1-864083-74-3
AU $23.95
144 pp.
NCELTR
Sydney, Australia
Reviewed by Paul Thompson, University of Reading
Exploring Academic English is an innovative concordance-based workbook for use either in an English
for Academic Purposes (EAP) writing class or for independent learning. What makes it innovative is that
it is the first workbook to utilize corpus study methods to systematically introduce and explore the use of
certain words to perform rhetorical functions in academic written English. Thus, it should be of interest to
both native and non-native speakers of English, who have at least intermediate proficiency in English, and
who are preparing to enter, or already have entered, tertiary education.
There are two ways that linguistic corpora can be exploited for pedagogical purposes (Partington, 1998):
teachers can either analyse corpora for material/syllabus design (Flowerdew, 1993), or they can train
students to use corpora directly. The latter use is designed to promote what Tim Johns has described as
data-driven learning, or DDL.1 Exploring Academic English offers an interesting combination of both
methods. The authors have used a specialized corpus of academic English which they first analyzed in
order to determine the syllabus of the book, and they have also presented selected output from the same
corpus as data for learner activities in which the learner acts as language researcher.
Exploring Academic English is a methodical and clearly presented workbook. Each of its six units deals
with a "rhetorical function," as follows:
•
•
•
•
•
•
stating the topic
referring to the literature
reporting the research of others
discussing processes undertaken in the study
expressing opinions tentatively
drawing conclusions
For each function, three or four lexical items are focussed on, with each unit following the same fourstage path. Firstly, in the "Look" stage, a set of concordance lines is presented, sorted by the first word to
the right of the search term. In the case of analysis, for example, this means that all the concordance lines
containing "analysis of" are placed together, and they appear after "…any analysis must …" (see Figure
1). As concordances can be difficult to read for first-time readers (key word in context, or KWIC,
concordances are incomplete sentences), the learner is advised not to try to understand every word, but
rather, to concentrate on the words around the search term.
Copyright 2001, ISSN 1094-3501.
28
Paul Thompson
Review of Exploring Academic English
Figure 1. KWIC concordance of analysis
In the second, or "Familiarize," stage, students are given a set of tasks related to these concordance lines,
in which they identify lexical patterns around the key word: which prepositions follow the word, and in
what contexts; which words commonly precede the word; and so on. They are also asked to decide which
of a number of suggested senses the word can have based on the evidence available from the data, and
this often involves interpreting possible gradations of meaning. In the third stage, "Practise," students are
asked to do gap-fill and matching exercises without referring back to the concordances. Finally, in the
"Create" stage, they write a sentence or paragraph on a specified topic in which they practise the use of
the key word. In the chapter on "Drawing Conclusions and Summarising," for example, the learner is
asked to write a paragraph summarising the main differences between the terms conclusions and
summaries which they have studied in the Look, Familiarise, and Practise stages.
An important point to note is that students work throughout not on concocted examples, but on data
drawn from a corpus of authentic academic texts, whether these be concordance lines or the sentences for
the gap-fill exercises. Suggested answers to all the exercises are given at the end of the book with
commentary provided where appropriate.
The corpus used by Thurstun and Candlin is the Microconcord Corpus of Academic Texts,2 an electronic
collection of academic books and papers from a range of disciplines, with a total word count of over one
million words. The authors first identified words in the University Word List (see Nation, 1990) that
could be used in the performance of the specific rhetorical functions outlined above. Using the
Microconcord programme, the authors then produced sets of KWIC concordance lines in order to observe
frequencies of use as well as the lexical patterning around these words. Finally, they extracted those lines
that concisely represent the most common collocational features surrounding the word that had been
searched for (a full account of the procedures and the principles underlying them can be found in
Thurstun & Candlin, 1998).
Language Learning & Technology
29
Paul Thompson
Review of Exploring Academic English
Three points are worth making. Firstly, the corpus used is broadly representative of academic writing.
Against this, it can be argued convincingly that the texts chosen should have been closer to the types of
texts that students themselves will have to write, rather than a collection of texts written by expert writers
for a general academic audience, but such a corpus, of sufficient size, was not available at the time that
the book was developed. The Microconcord Corpus of Academic Texts is a far more relevant source of
data for EAP teaching than any of the large general corpora, and there are, to date, no large corpora of
native speaker student-generated academic text publicly available. Secondly, the words included are all
what can be termed "semi-technical vocabulary": lexical items that are more likely to appear in scientific
or academic than in more general texts, and also likely to appear in a wide range of academic texts.
Thirdly, the authors have sifted the concordance lines to reduce the amount of lines that learners will have
to look through. One of the criticisms of Data-Driven Learning is that students can be overwhelmed by
the sheer quantity of information if they are asked to investigate corpora by themselves. This workbook
circumvents the problem by sorting through the raw data in advance and distilling the output to a
manageable level.
I found some of the so-called "Create" exercises, especially in the earlier parts of the book, mechanical,
and felt that they were neither creative nor did they test the learner's understanding. In Unit 3, for
example, to practise the use of the verb claim, the learner is asked to report the following statement (and
two others) using claim: "Even today, Canadians are not nearly so far away from the tradition of
Victorian gentility as we imagine (Waddington, 1989)."
There is no need for the learner to invest any original thought in this exercise in syntactic manipulation. In
the unit on "Expressing Ideas Tentatively," the manipulation required is less demanding; the learner is
asked to rewrite three sentences, using may in order to make the sentences tentative, for example, "This
alteration is excitatory or inhibitory: that is, it makes the receiving cell more or less likely to emit
impulses itself."
The rewriting involves changing the first "is" to "may be" and "makes" to "may make," which is a simple
task and does not require the writer to demonstrate an understanding of what tentativeness is, nor when it
is necessary to be tentative. A further problem is that if the learner does think about the sentence
analytically, he/she will note that the insertion of "may" does not actually make the statement tentative -it remains a factive statement, explaining the two (known) types of alteration. In such cases, the teacher
might decide to leave out those exercises, and devise their own activities in place of the them.
Generally speaking, though, the book is an excellent implementation of corpus-informed (and informing)
insights. Because the concordances are already sifted and are available in a paper format, they are
immediately accessible. The repeated use of the four-stage approach also trains the learner in effective
corpus analysis skills.
As the authors themselves acknowledge in their article, a possible criticism of the approach is that a great
deal of time is invested on a relatively small number of words (19 in all). Learners may well feel that they
could invest their energies more profitably in acquiring a larger vocabulary in the same period of time,
with a little less depth. For example, the three exponents of one particular function dealt with in the book,
that of "Reporting the Research of Others," do not provide sufficient lexical resources for the developing
academic writer: "according to," "claim," and "suggest" are a beginning but will soon prove painfully
restricting unless the repertoire is supplemented. If this book is to be used as part of a writing course,
therefore, learners will need extension activities, so that they can explore other related key vocabulary
items for each function. Provided that they have access to appropriate corpora facilities (an academic text
corpus of adequate size, with good documentation, and concordancing software), they could be asked to
work in groups on different lexical items (hedging words, for example, as listed in the appendices of
Hyland, 2000, pp. 188-189) and present reports of their findings to the whole class. A wealth of ideas for
using concordancing in the classroom can be found in Tribble and Jones (1990) and in Partington (1998).
Language Learning & Technology
30
Paul Thompson
Review of Exploring Academic English
It should also be pointed out that a concordance-driven approach is primarily inductive: Learners are
invited to look for patterns in the data, and to form generalisations that can account for the patterns they
find. Not all learners like such an approach (Dudley-Evans & St John, 1998, p. 86), and teachers need to
consider whether such consciousness-raising activities are appropriate for their learners.
For those who are attracted to such an approach, however, using concordances in the classroom is a
stimulating and highly rewarding experience, both for teachers and learners. Exploring Academic English
makes the use of concordances in the EAP classroom much easier by following a highly systematic
approach and presenting sets of ready-made concordance lines, and is an impressive new departure both
in the field of EAP writing teaching materials, and of foreign language teaching materials writing in
general. It is reasonably priced and can either be used as a classroom textbook (I would see it as most
useful as a supplementary workbook), or for self-study, provided that learners are given some training in
working with concordance lines first.
NOTES
1. For the definitive bibliography on DDL, see Tim Johns' Web site
(http://web.bham.ac.uk/johnstf/biblio.htm).
2. Originally sold by Oxford University Press as an optional companion to the Microconcord
concordancing programme (Scott & Johns, 1993), but sadly now out of print.
ABOUT THE REVIEWER
Paul Thompson is a Research Fellow in the School of Linguistics and Applied Language Studies, at the
University of Reading, UK. His research interests are: second language writing pedagogy, the corpusbased analysis of academic discourse, and applications of Information Technology to language teaching.
E-mail: p.a.thompson@reading.ac.uk
REFERENCES
Dudley-Evans, A., & St John, M. (1998). Developments in English for specific purposes. Cambridge, UK:
Cambridge University Press.
Flowerdew, J. (1993). Concordancing as a tool in course design. System, 21(2), 231-244.
Hyland, K. (2000). Disciplinary discourses: Social interactions in academic writing. London: Longman.
Nation, P. (1990). Teaching and learning vocabulary. New York: Newbury House.
Partington, A. (1998). Patterns and meanings: Using corpora for English language research and
teaching. Amsterdam: John Benjamins.
Scott, M., & Johns, T. (1993). Microconcord. Oxford, UK: Oxford University Press.
Thurstun, J., & Candlin, C. (1998). Concordancing and the teaching of the vocabulary of academic
English. English for Specific Purposes 17(3), pp. 267-280.
Tribble, C., & Jones, G. (1990). Concordances in the classroom. London: Longman
Language Learning & Technology
31
Language Learning & Technology
http://llt.msu.edu/vol5num3/review4/
September 2001, Vol. 5, Num. 3
pp. 32-36
REVIEW OF MONOCONC PRO AND WORDSMITH TOOLS
Title
Developer
Platform
MonoConc Pro Version 2.0
PC
Hardware/
Windows 95 or higher
System
Requirements
Program
Information
Publisher
Support
Languages
Audience
ISBN
Price
Athelstan
info@athel.com
On-line help and a small manual
Can be used with different languages
Beginning to advanced users
not applicable
US $85 single user;
US $550 15 user site
WordSmith Tools Version 3.0
Mike Scot
PC
Minimum 80386 processor, VGA display
or better, Windows 3.1x or Windows 95,
minimum 4 MB RAM (8 MB if used with
Windows 95).
http://www.oup.com:8080/elt/global/catal
ogue/multimedia/wordsmithtools3/
Oxford University Press
On-line help and an extensive manual
Can be used with different languages
Beginning to advanced users
0-19-45-92863
51.95 British pounds
Reviewed by Randi Reppen, Northern Arizona University
The recent interest in corpus linguistics and the use of authentic materials has created a need for software
packages that allow teachers and researchers to carry out corpus-based investigations. These corpus-based
investigations can be used to augment classroom instruction so that ESL/EFL students are exposed to real
language rather than artificial texts and made-up examples. Teachers and researchers can also begin to
explore some of the more subtle areas of language use where our intuitions often lead us in the wrong
direction.
In this review, I will take a close look at WordSmith Tools (Version 3) and MonoConc Pro (Version 2),
two of the more readily available and reasonably priced packages for working with corpora, in order to
contrast the different options that they offer teachers and researchers. As with any software purchase, the
needs of the user should play a key role in deciding which program is most appropriate. Both programs
include many of the same features, such as the ability to create word lists (in both alphabetical order and
frequency order), generate concordance output, and give collocation information. Both programs easily
handle large corpora and work with either tagged or untagged texts. As with any software package, the
user needs to check the default settings (e.g., minimum or maximum number of hits displayed) to make
certain that they are set according to the users' desires. In the following paragraphs, I describe the major
features shared by the two programs as well as some of the more specialized features offered by only one
or the other.
One of the major innovations of these packages is that they allow users to analyze any collection of
ASCII texts. This is in marked contrast to earlier concordancing packages which required the user to build
a database of texts before using the program for analyses. This was usually an elaborate process, and
sometimes required sending texts to the software author or publisher before the concordancing tools could
be used. Further, the database normally could not be modified once it was constructed. Thus, the database
needed to be rebuilt any time additional texts were added. WordSmith and MonoConc Pro differ from
these earlier packages in that they allow the user to select any group of texts for analysis every time the
system is started. Better yet, additional texts can be added "on the fly," so that the corpus being analyzed
can be tailored to directly fit the immediate research questions.
Copyright 2001, ISSN 1094-3501
32
Randi Reppen
Review of MonoConc Pro and WordSmith Tools
The primary research use of both software packages is to generate concordances, or listings of all the
occurrences of any given word in a given text, with words shown in context. Concordance listings can be
useful for exploring the use and meanings of specific words. Often when looking at concordance lines,
users may want to expand the context so that they can get a better sense of the meaning or use. Here is
one area where the two programs differ quite a bit. Both programs allow the user to adjust the settings of
the concordance program to display more or less text on the concordance screen. However, MonoConc
Pro has an additional feature that is especially attractive for researchers: the split screen display allows
users to expand the context of an entry line simply by highlighting the line, which displays the fuller
context in the upper window (see Figure 1). In WordSmith, the entire display must be expanded or
reduced, so the context is expanded for all of the entries being viewed rather than for a single highlighted
entry.
Figure 1. MonoConc Pro screen display of concordance lines
Another nice feature of MonoConc Pro is that the total number of words in the corpus is always displayed
in the lower right hand corner (as shown in Figure 1). This information is vital for comparisons of texts of
unequal lengths, as the normalization of counts of linguistic features, a process that allows such
comparisons to be carried out accurately, relies on text length (for more information, see Methodology
Box 6 in Biber, Conrad, & Reppen, 1998, pp. 263-264).
Both programs have sort functions that allow users to sort concordance lines in several ways (e.g., by
search word, then first word right; or by first word). Sorting words and seeing the collocation
Language Learning & Technology
33
Randi Reppen
Review of MonoConc Pro and WordSmith Tools
immediately to the left or right of the target word can provide insights on word senses and uses. Another
feature found in both programs is the ability to "blank out" target words in the concordance output, which
can be useful to teachers for the development of vocabulary activities and cloze tests. By using corpora,
rather than teacher-made examples, teaching and testing materials reflect the language found in authentic
texts and thus provide learners with more exposure to real language. Concordance displays are quite
similar in both programs.
In addition to the functions that these programs have in common, WordSmith is able to perform a number
of useful tasks that MonoConc Pro is not. For example, WordSmith can provide information about the
distribution of a feature in a single text or across texts. Distributions are shown with a graph that plots the
occurrences of the target item in the text or corpus (see Figure 2). The distribution of a particular lexical
or grammatical feature across a text or series of texts can provide interesting information about the text
structure and also about how the feature functions across various texts. A similar tool is available in
MonoConc Pro; however, I was unable to interpret the bar graph display used in MonoConc Pro.
Figure 2. WordSmith plot distribution by text for the occurrence of thank
WordSmith also allows the user to compare word lists. The Key Word function allows the user to compare
a given text to a target text or target register, which can be particularly useful for cross-register
comparisons. For example, a teacher or researcher could compare biology textbooks to geology textbooks
in order to see what lexical similarities or differences occur. The Key Word function provides a quick
glimpse of what the text is about, since the list is not based on absolute frequency but rather the unique
words that are frequent in the particular text.
The Cluster function is the WordSmith feature that is perhaps most innovative since it is quite powerful
and can be very useful. With this function, the user can specify from two to eight word clusters from a
concordance list and then see which words tend to co-occur (see Figure 3). Co-occurring words are often
idioms or set phrases.
Language Learning & Technology
34
Randi Reppen
Review of MonoConc Pro and WordSmith Tools
Figure 3. WordSmith screen with clusters
WordSmith also has a feature that allows the user to align two texts and create a new file that contains one
displayed over the other. This is extremely useful for comparing translations or two versions of the same
text. The texts are displayed in different colors for ease of reading. See Figure 4 for an example of this
feature used to check a translation against the original text.
Figure 4. Aligning two texts to check a translation (excerpt from WordSmith on-line manual)
Language Learning & Technology
35
Randi Reppen
Review of MonoConc Pro and WordSmith Tools
The main advantage of MonoConc Pro over WordSmith is that it is much easier to use. For example,
when MonoConc Pro is launched, a clear easy-to-use screen appears with a bar across the top, providing
the options available. On the other hand, when WordSmith is launched there are many screens that appear,
and until the user becomes familiar with the program, just getting the program going can be a bit of a
challenge. For someone starting out with corpus analysis, and wanting to focus mostly on concordancing,
MonoConc Pro is more user-friendly. The screens are clearer, and since they resemble the screens of
many word processing programs, users may feel more comfortable.
In summary, both programs offer users powerful tools for searching texts and exploring how language is
used in natural settings, thus providing valuable resources for teachers and researchers. However, the two
programs have different strengths: for users who are less comfortable with computers, MonoConc Pro's
interface is much more user-friendly than that of WordSmith. However, for those who are comfortable
with computers and plan to carry out more powerful text analysis, WordSmith would be a better choice.
So, while both MonoConc Pro and WordSmith offer attractive options for exploring texts, the best choice
will depend on the specific goals and experience of the user.
ABOUT THE REVIEWER
Randi Reppen is an Assistant Professor in Northern Arizona University's MA-TESL/PhD-Applied
Linguistics Program, and the Director of the Program in Intensive English. She is co-author of Corpus
Linguistics: Investigating Language Structure and Use with Douglas Biber and Susan Conrad (1998). Her
research interests include corpus linguistics and the use of corpora in materials development.
E-mail: Randi.Reppen@NAU.EDU
REFERENCE
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating language structure and
use. Cambridge, UK: Cambridge University Press.
Language Learning & Technology
36
Language Learning & Technology
http://llt.msu.edu/vol5num3/lee/
September 2001, Vol. 5, Num. 3. 3
pp. 37-72
GENRES, REGISTERS, TEXT TYPES, DOMAINS, AND STYLES:
CLARIFYING THE CONCEPTS AND NAVIGATING A PATH THROUGH
THE BNC JUNGLE
David YW Lee
Lancaster University, UK
ABSTRACT
In this paper, an attempt is first made to clarify and tease apart the somewhat confusing
terms genre, register, text type, domain, sublanguage, and style. The use of these terms
by various linguists and literary theorists working under different traditions or
orientations will be examined and a possible way of synthesising their insights will be
proposed and illustrated with reference to the disparate categories used to classify texts in
various existing computer corpora. With this terminological problem resolved, a personal
project which involved giving each of the 4,124 British National Corpus (BNC, version
1) files a descriptive "genre" label will then be described. The result of this work, a
spreadsheet/database (the "BNC Index") containing genre labels and other types of
information about the BNC texts will then be described and its usefulness shown. It is
envisaged that this resource will allow linguists, language teachers, and other users to
easily navigate through or scan the huge BNC jungle more easily, to quickly ascertain
what is there (and how much) and to make informed selections from the mass of texts
available. It should also greatly facilitate genre-based research (e.g., EAP, ESP, discourse
analysis, lexicogrammatical, and collocational studies) and focus everyday classroom
concordancing activities by making it easy for people to restrict their searches to highly
specified sub-sets of the BNC using PC-based concordancers such as WordSmith,
MonoConc, or the Web-based BNCWeb.
INTRODUCTION
Most corpus-based studies rely implicitly or explicitly on the notion of genre or the related concepts
register, text type, domain, style, sublanguage, message form, and so forth. There is much confusion
surrounding these terms and their usage, as anyone who has done any amount of language research
knows. The aims of this paper are therefore two-fold. I will first attempt to distinguish among the terms
because I feel it is important to point out the different nuances of meaning and theoretical orientations
lying behind their use. I then describe an attempt at classifying the 4,124 texts in the British National
Corpus (BNC) in terms of a broad sense of genre, in order to give researchers and language teachers a
better avenue of approach to the BNC for doing all kinds of linguistic and pedagogical research.
Categorising Texts: Genres, Registers, Domains, Styles, Text Types, & Other Confusions
Why is it important to know what these different terms mean, and why should corpus texts be classified
into genres? The short answer is that language teachers and researchers need to know exactly what kind
of language they are examining or describing. Furthermore, most of the time we want to deal with a
specific genre or a manageable set of genres, so that we can define the scope of any generalisations we
make. My feeling is that genre is the level of text categorisation which is theoretically and pedagogically
most useful and most practical to work with, although classification by domain is important as well (see
discussion below). There is thus a real need for large-scale general corpora such as the BNC to clearly
label and classify texts in a way that facilitates language description and research, beyond the
Copyright 2001, ISSN 1094-3501
37
David Lee
Genres, Registers, Text Types, Domains, and Styles
very broad classifications currently in place. It is impossible to make many useful generalisations about
"the English language" or "general English" since these are abstract constructions. Instead, it is far easier
and theoretically more sound to talk about the language of different genres of text, or the language(s) used
in different domains, or the different types of register available in a language, and so forth.
Computational linguists working in areas of natural language processing/language engineering have long
realised the need to target the scope of their projects to very specific areas, and hence they talk about
sublanguages such as air traffic control talk, journal articles on lipoprotein kinetics, navy telegraphic
messages, weather reports, and aviation maintenance manuals. (see Grishman & Kittredge, 1986;
Kittredge & Lehrberger, 1982, for detailed discussions of "sublanguages").
The terminological issue I grapple with here is a very vexing one. Although not all linguists will
recognise or actively observe the distinctions I am about to make (in particular, the use of the term text
type, which can be used in a very vague way to mean almost anything), I believe there is actually more
consensus on these issues than users of these terms themselves realise, and I hope to show this below.
Internal Versus External Criteria: Text Type & Genre
One way of making a distinction between genre and text type is to say that the former is based on
external, non-linguistic, "traditional" criteria while the latter is based on the internal, linguistic
characteristics of texts themselves (Biber, 1988, pp. 70 & 170; EAGLES, 1996).1 A genre, in this view, is
defined as a category assigned on the basis of external criteria such as intended audience, purpose, and
activity type, that is, it refers to a conventional, culturally recognised grouping of texts based on
properties other than lexical or grammatical (co-)occurrence features, which are, instead, the internal
(linguistic) criteria forming the basis of text type categories. Biber (1988) has this to say about external
criteria:
Genre categories are determined on the basis of external criteria relating to the speaker's
purpose and topic; they are assigned on the basis of use rather than on the basis of form.
(p. 170)
However, the EAGLES (1996)2 authors would quibble somewhat with the inclusion of the word topic
above and argue that one should not think of topic as being something to be established a priori, but rather
as something determined on the basis of internal criteria (i.e., linguistic characteristics of the text):
Topic is the lexical aspect of internal analysis of a text. Externally the problem of
classification is that there are too many possible methods, and no agreement or stability
in societies or across them that can be built upon ... The boundaries between ... topics are
ultimately blurred, and we would argue that in the classification of topic for corpora, it is
best done on a higher level, with few categories of topic which would alter according to
the language data included. There are numerous ways of classifying texts according to
topic. Each corpus project has its own policies and criteria for classification … The fact
that there are so many different approaches to the classification of text through topic, and
that different classificatory topics are identified by different groups indicates that existing
classification[s] are not reliable. They do not come from the language, and they do not
come from a generally agreed analysis. However they are arrived at, they are subjective,
and … the resulting typology is only one view of language, among many with equal
claims to be the basis of a typology. (p. 17)
So perhaps it is best to disregard the word "topic" in the quote from Biber above, and take genres simply
as categories chosen on the basis of fairly easily definable external parameters. Genres also have the
property of being recognised as having a certain legitimacy as groupings of texts within a speech
community (or by sub-groups within a speech community, in the case of specialised genres). This is
Language Learning & Technology
38
David Lee
Genres, Registers, Text Types, Domains, and Styles
essentially the view of genre taken by Swales (1990, pp. 24-27), who talks about genres being "owned"
(and, to varying extents, policed) by particular discourse communities.
Without going into the minutiae of the EAGLES' recommendations, all I will say is that detailed, explicit
recommendations do not yet exist in terms of identifying text types or, indeed, any so-called "internal
criteria." That is, there are as yet, no widely-accepted or established text-type-based categories consisting
of texts which cut across traditionally recognisable genres on the basis of internal linguistic features (see
discussion below). On the subject of potentially useful internal classificatory criteria, the EAGLES
authors mention the work of Phillips (1983) under the heading of topic (the "aboutness" or
"intercollocation of collocates" or "lexical macrostructures" of texts), and the work of Biber (1988, 1989)
and Nakamura (1986, 1987, 1992, 1993) under the heading of style (which the EAGLES' authors
basically divide into "formal/informal," combining this with parameters such as "considered/impromptu"
and "one-way/interactive"). However, the authors offer no firm recommendations, merely the observation
that "these are only shafts of light in a vast darkness" (p. 25), and they do not mention what a possible text
type could be (in fact, no examples are even given of possible labels for text types). At present, all corpora
use only external criteria to classify texts. Indeed, as Atkins, Clear, & Ostler (1992, p. 5) note, there is a
good reason for this:
The initial selection of texts for inclusion in a corpus will inevitably be based on external
evidence primarily … A corpus selected entirely on internal criteria would yield no
information about the relation between language and its context of situation.
The EAGLES (1996) authors add that
[the] classification of texts based purely on internal criteria does not give prominence to
the sociological environment of the text, thus obscuring the relationship between the
linguistic and non-linguistic criteria. (p. 7)
Coming back to the distinction between genre and text type, therefore, the main thing to remember here is
what the two different approaches to classification mean for texts and their categorisation. In theory, two
texts may belong to the same text type (in Biber's sense) even though they may come from two different
genres because they have some similarities in linguistic form (e.g., biographies and novels are similar in
terms of some typically "past-tense, third-person narrative" linguistic features). This highly restricted use
of text type is an attempt to account for variation within and across genres (and hence, in a way, to go
"above and beyond" genre in linguistic investigations). Biber's (1989, p. 6) use of the term, for example,
is prompted by his belief that "genre distinctions do not adequately represent the underlying text types of
English …; linguistically distinct texts within a genre represent different text types; linguistically similar
texts from different genres represent a single text type."
Paltridge (1996), in an article on "Genre, Text Type, and the Language Learning Classroom," makes
reference to Biber (1988; but, crucially, not to Biber 1989)3 and proposes a usage of the terms genre and
text type which he claims is in line with Biber's external/internal distinction, as delineated above. It is
clear from the article, however, that what Paltridge means by "internal criteria" differs considerably from
what Biber meant. Paltridge proposes the following distinction:
Language Learning & Technology
39
David Lee
Genres, Registers, Text Types, Domains, and Styles
Table 1. Paltridge's Examples of Genres and "Text Types" (based on Hammond, Burns, Joyce, Brosnan,
& Gerot, 1992)
Genre
Recipe
Personal letter
Advertisement
Police report
Student essay
Formal letter
Format letter
News item
Health brochure
Student assignment
Biology textbook
Film review
Text Type
Procedure
Anecdote
Description
Description
Exposition
Exposition
Problem–Solution
Recount
Procedure
Recount
Report
Review
As can be seen, what Paltridge calls "text types" are probably better termed "discourse/rhetorical structure
types," since the determinants of his "text types" are not surface-level lexicogrammatical or syntactic
features (Biber's "internal linguistic features"), but rhetorical patterns (which is what Hoey, 1986, p. 130,
for example, calls them). Paltridge's sources, Meyer (1975), Hoey (1983), Crombie (1985) and Hammond
et al. (1992) are all similarly concerned with text-level/discoursal/rhetorical structures or patterns in texts,
which most linguists would probably not consider as constituting 'text types' in the more usual sense.
Returning to Biber's distinction between genre and text type, then, what we can say is that his "internal
versus external" distinction is attractive. However, as noted earlier, the main problem is that linguists
have still not firmly decided on or enumerated or described in concrete terms the kinds of text types (in
Biber's sense) we would profit from looking at. Biber's (1989) work on text typology (see also Biber &
Finegan,1986) using his factor-analysis-based multi-dimensional (MD) approach is the most suggestive
work so far in this area, but his categories do not seem to have been taken up by other linguists. His eight
text types (e.g., "informational interaction," "learned exposition," "involved persuasion") are claimed to
be maximally distinct in terms of their linguistic characteristics. The classification here is at the level of
individual texts, not groups such as "genres," so texts which nominally "belong together" in a "genre" (in
terms of external criteria) may land up in different text types because of differing linguistic
characteristics. An important caveat to mention, however, is that there are many questions surrounding
the statistical validity, empirical stability, and linguistic usefulness of the linguistic "dimensions" from
which Biber derives these "text types," or clusters of texts sharing internal linguistic characteristics (see
Lee, 2000, for a critique) and hence these text typological categories should be taken as indicative rather
than final. Kennedy (1998) has said, for example, that
Some of the text types established by the factor analysis do not seem to be clearly
different from each other. For example, the types "learned" and "scientific" exposition …
may differ only in some cases because of a higher incidence of active verbs in the
"learned" text type. (p. 188)
One could also question the aptness or helpfulness of some of the text type labels (e.g., how useful is it to
know that 29% of "official documents" belong to the text type "scientific exposition"?).
It therefore still remains to be seen if stable and valid dimensions of (internal) variation, which can serve
as useful criteria for text typology, can be found. At the risk of rocking the boat, I would also like to say
that, personally, I am not convinced that there is a pressing need to determine "all the text types in the
Language Learning & Technology
40
David Lee
Genres, Registers, Text Types, Domains, and Styles
English language" or to balance corpora on the basis of these types. Biber (1993) notes that it is more
important as a first step in compiling a corpus to focus on covering all the situational parameters of
language variation, because they can be determined prior to the collection of texts, whereas
there is no a priori way to identify linguistically defined types ... [however,] the results of
previous research studies, as well as on-going research during the construction of a
corpus, can be used to assure that the selection of texts is linguistically as well as
situationally representative [italics added]. (p. 245)
My question, however, is: what does it mean to say that a corpus is "linguistically representative" or
linguistically balanced? Also, why should this be something we should strive towards? The EAGLES'
(1996) authors say that we should see progress in corpus compilation and text typology as a cyclical
process:
The internal linguistic criteria of the text [are] analysed subsequent to the initial selection
based on external criteria. The linguistic criteria are subsequently upheld as particular to
the genre … [Thus] classification begins with external classification and subsequently
focuses on linguistic criteria. If the linguistic criteria are then related back to the external
classification and the categories adjusted accordingly, a sort of cyclical process ensues
until a level of stability is established. (p. 7)
Or, as the authors say later, this process is one of "frequent cross-checking between internal and external
criteria so that each establishes a framework of relevance for the other" (p. 25). Beyond these rather
abstract musings, however, there is not enough substantive discussion of what text types or other kinds of
internally-based criteria could possibly look like or how exactly they would be useful in balancing
corpora.
In summary, with text type still being an elusive concept which cannot yet be established explicitly in
terms of linguistic features, perhaps the looser use of the term by people such as Faigley and Meyer
(1983) may be just as useful: they use text type in the sense of the traditional four-part rhetorical
categories of narrative, description, exposition and argumentation. Steen (1999, p. 113) similarly calls
these four classes "types of discourse."4 Stubbs (1996, p. 11), on the other hand, uses text type and genre
interchangeably, in common, perhaps, with most other linguists. At present, such usages of text type
(which do not observe the distinctions Biber and EAGLES try to make) are perhaps as consistent and
sensible as any, as long as people make it clear how they are using the terms. It does seem redundant,
however, to have two terms, each carrying its own historical baggage, both covering the same ground.
"Genre," "Register," and "Style"
Other terms often used in the literature on language variation are register and style. I will now walk into a
well-known quagmire and try to distinguish between the terms genre, register, and style. In his
Dictionary of Linguistics and Phonetics, Crystal (1991, p. 295) defines register as "a variety of language
defined according to its use in social situations, e.g. a register of scientific, religious, formal English."
(Presumably these are three different registers.) Interestingly, Crystal does not include genre in his
dictionary, and therefore does not try to define it or distinguish it from other similar/competing terms. In
Crystal & Davy (1969), however, the word style is used in the way most other people use register: to
refer to particular ways of using language in particular contexts. The authors felt that the term register had
become too loosely applied to almost any situational variety of language of any level of generality or
abstraction, and distinguished by too many different situational parameters of variation. (Using style in
the same loose fashion, however, hardly solves anything, and, as I argue below, goes against the usage of
style by most people in relation to individual texts or individual authors/speakers.)
The two terms genre5 and register are the most confusing, and are often used interchangeably, mainly
because they overlap to some degree. One difference between the two is that genre tends to be associated
Language Learning & Technology
41
David Lee
Genres, Registers, Text Types, Domains, and Styles
more with the organisation of culture and social purposes around language (Bhatia, 1993; Swales, 1990),
and is tied more closely to considerations of ideology and power, whereas register is associated with the
organisation of situation or immediate context. Some of the most elaborated ideas about genre and
register can be found within the tradition of systemic functional grammar. The following diagram (Martin
& Matthiessen, 1991, reproduced in Martin, 1993, p. 132), shows the relation between language and
context, as viewed by most practitioners of systemic-functional grammar:
Figure 1. Language and context in the systemic functional perspective
In this tradition, register is defined as a particular configuration of field, tenor, and mode choices (in
Hallidayan grammatical terms), in other words, a language variety functionally associated with particular
contextual or situational parameters of variation and defined by its linguistic characteristics. The
following diagram illustrates this more clearly:
Language Learning & Technology
42
David Lee
Genres, Registers, Text Types, Domains, and Styles
Figure 2. Metafunctions in relation to register and genre6
Genre, on the other hand, is more abstractly defined:
A genre is known by the meanings associated with it. In fact the term "genre" is a short
form for the more elaborate phrase "genre-specific semantic potential" … Genres can
vary in delicacy in the same way as contexts can. But for some given texts to belong to
one specific genre, their structure should be some possible realisation of a given GSP
Generic Structure Potential … It follows that texts belonging to the same genre can vary
in their structure; the one respect in which they cannot vary without consequence to their
genre-allocation is the obligatory elements and dispositions of the GSP. (Halliday &
Hasan, 1985, p. 108)
[T]wo layers of context are needed -- with a new level of genre [italics added] posited
above and beyond the field, mode and tenor register variables … Analysis at this level
has concentrated on making explicit just which combinations of field, tenor and mode
variables a culture enables, and how these are mapped out as staged, goal-oriented social
processes [italics added]. (Eggins & Martin, 1997, p. 243)
These are rather theory-specific conceptualisations of genre, and are therefore a little opaque to those not
familiar with systemic-functional grammar. The definition of genre in terms of "staged, goal-oriented
social processes" (in the quote above, and in Martin, Christie, & Rothery, 1987), is, in particular, slightly
confusing to those who are more concerned (or familiar) with genres as products (i.e., groupings of texts).
Ferguson (1994), on the other hand, offers a less theory-specific discussion. However, he is rather vague,
and talks about (and around) the differences between the two terms while never actually defining them
precisely: He seems to regard register as a "communicative situation that recurs regularly in a society" (p.
20) and genre as a "message type that recurs regularly in a community" (p. 21). Faced with such
comparable definitions, readers will be forgiven for becoming a little confused. Also, is register only a
"communicative situation," or is it a variety of language as well? In any case, Ferguson also seems to
equate sublanguage with register (p. 20) and offers many examples of registers (e.g., cookbook recipes,
stock market reports, regional weather forecasts) and genres (e.g., chat, debate, conversation, recipe,
obituary, scientific textbook writing) without actually saying why any of the registers cannot also be
Language Learning & Technology
43
David Lee
Genres, Registers, Text Types, Domains, and Styles
thought of as genres or vice versa. Indeed, sharp-eyed readers will have noted that recipes are included
under both register and genre.
Coming back to the systemic-functional approach, it will be noted that even among subscribers to the
"genre-based" approach in language pedagogy (Cope & Kalantzis, 1993), opinions differ on the definition
and meaning of genre. For J. R. Martin, as we have seen, genre is above and beyond register, whereas for
Gunther Kress, genre is only one part of what constitutes his notion of register (a superordinate term).
The following diagram illustrates his use of the terms:
Figure 3. Elements of the composition of text (Kress, 1993, p. 35)
Kress (1993) appears to dislike the fact that genre is made to carry too much baggage or different strands
of information:
There is a problem in using such a term [genre] with a meaning that is relatively
uncontrollable. In literary theory, the term has been used with relative stability to
describe formal features of a text -- epitaph, novel, sonnet, epic -- although at times
content has been used to provide a name, [e.g.] epithalamion, nocturnal, alba. In screen
studies, as in cultural studies, labels have described both form and content, and at times
other factors, such as aspects of production. Usually the more prominent aspect of the
text has provided the name. Hence "film noir"; "western" or "spaghetti western" or
"psychological" or "Vietnam western"; "sci-fi"; "romance"; or "Hollywood musical"; and
similarly with more popular print media. (pp. 31-2)
In other words, Kress is complaining about the fact that
a great complex of factors is condensed and compacted into the term -- factors to do with
the relations of producer and audience, modes of production and consumption, aesthetics,
histories of form and so on. (p. 32)
He claims that many linguists, educators, and literacy researchers, especially those working within the
Australian-based "genre theory/school" approach, use the term in the same all-encompassing way. Also,
he is concerned that the work of influential people like Martin and Rothery has been focussed too much
on presenting ideal generic texts and on the successive "unfolding" of "sequential stages" in texts (which
are said to reflect the social tasks which the text producers perform; Paltridge, 1995, 1996, 1997):
The process of classification … seems at times to be heading in the direction of a new
formalism, where the 'correct' way to write [any particular text] is presented to students in
the form of generic models and exegeses of schematic structure. (Kress, 1993, p. 12)
Language Learning & Technology
44
David Lee
Genres, Registers, Text Types, Domains, and Styles
Those familiar with Kress' work in critical discourse analysis (e.g., Kress & Hodge, 1979) should not be
surprised to learn, however, that in his approach to genre the focus is instead:
… on the structural features of the specific social occasion in which the text has been
produced [, seeing] these as giving rise to particular configurations of linguistic factors in
the text which are realisations of, or reflect, these social relations and structures [,…e.g.]
who has the power to initiate turns and to complete them, and how relations of power are
realised linguistically. In this approach "genre" is a term for only a part of textual
structuring, namely the part which has to do with the structuring effect on text of sets of
complex social relations between consumers and producers of texts. [all italics added] (p.
33)
As can be seen, therefore, there is a superficial terminological difference in the way genre is used by
some theorists, but no real, substantive disagreement because they both situate it within the broader
context of situational and social structure. While genre encompasses register and goes "above and
beyond" it in Martin's (1993, Eggins & Martin, 1997) terms, it is only one component of the larger
overarching term register in Kress' approach. My own preferred usage of the terms comes closest to
Martin's, and will be described below. Before that, however, I will briefly consider two other attempts at
clearing up the terminological confusion.
Sampson (1997) calls for re-definitions of genre, register, and style and the relationships among them, but
his argument is not quite lucid or convincing enough. In particular, his proposal for register to be
recognised as fundamentally to do with an individual's idiolectal variation seems to go against the grain of
established usage, and is unlikely to catch on. Biber (Finegan & Biber, 1994, pp. 51-53; 1995, pp. 7-10)
does a similar survey, looking at the use of the terms register, genre, style, sublanguage, and text type in
the sociolinguistic literature, and despairingly comes to the conclusion that register and genre, in
particular, cannot be teased apart. He settles on register as "the general cover term associated with all
aspects of variation in use" (1995, p. 9), but in so doing reverses his choice of the term genre in his earlier
studies, as in Biber (1988) and Biber & Finegan (1989). (Further, as delineated in Finegan & Biber, 1994,
Biber also rather controversially sees register variation as a very fundamental basis or cause of social
dialect variation.)
While hoping not to muddy the waters any further, I shall now attempt to state my position on this
terminological issue. My own view is that style is essentially to do with an individual's use of language.
So when we say of a text, "It has a very informal style," we are characterising not the genre to which it
belongs, but rather the text producer's use of language in that particular instance (e.g., "It has a very
quirky style"). The EAGLES (1996) authors are not explicit about their stand on this point, but say they
use style to mean:
the way texts are internally differentiated other than by topic; mainly by the choice of the
presence or absence of some of a large range of structural and lexical features. Some
features are mutually exclusive (e.g. verbs in the active or passive mood), and some are
preferential, e.g. politeness markers and mitigators. (p. 22)
As noted earlier, the main distinction they recommend for the stylistic description of corpus texts is
formal/informal in combination with parameters such as the level of preparation (considered/impromptu),
"communicative grouping" (conversational group; speaker/writer and audience; remote audiences) and
"direction" (one-way/interactive). This chimes with my suggestion that we should use the term style to
characterise the internal properties of individual texts or the language use by individual authors, with
"formality" being perhaps the most important and fundamental one. Joos's (1961) five famous epithets
"frozen," "formal," "informal," "colloquial," and "intimate" come in handy here, but these are only
suggestive terms, and may be multiplied or sub-divided endlessly, since they are but five arbitrary points
on a sliding scale. On a more informal level, we may talk about speakers or writers having a "humorous,"
Language Learning & Technology
45
David Lee
Genres, Registers, Text Types, Domains, and Styles
"ponderous," or "disjointed" style, or having a "repertoire of styles." Thus, describing one text as
"informal" in style is not to say the speaker/writer cannot also write in a "serious' style," even within the
same genre.
The two most problematic terms, register and genre, I view as essentially two different points of view
covering the same ground. In the same way that any stretch of language can simultaneously be looked at
from the point of view of form (or category), function, or meaning (by analogy with the three sides of a
cube), register and genre are in essence two different ways of looking at the same object.7 Register is
used when we view a text as language: as the instantiation of a conventionalised, functional configuration
of language tied to certain broad societal situations, that is, variety according to use. Here, the point of
view is somewhat static and uncritical: different situations "require" different configurations of language,
each being "appropriate" to its task, being maximally "functionally adapted" to the immediate situational
parameters of contextual use. Genre is used when we view the text as a member of a category: a culturally
recognised artifact, a grouping of texts according to some conventionally recognised criteria, a grouping
according to purposive goals, culturally defined. Here, the point of view is more dynamic and, as used by
certain authors, incorporates a critical linguistic (ideological) perspective: Genres are categories
established by consensus within a culture and hence subject to change as generic conventions are
contested/challenged and revised, perceptibly or imperceptibly, over time.
Thus, we talk about the existence of a legal register (focus: language), but of the instantiation of this in
the genres of "courtroom debates," "wills" and "testaments," "affidavits," and so forth (focus: category
membership). We talk about a formal register, where "official documents" and "academic prose" are
possible exemplar genres. In contrast, there is no literary register, but, rather, there are literary styles and
literary genres, because the very essence of imaginative writing is idiosyncrasy or creativity and
originality (focus on the individual style). My approach here thus closely mirrors that of Fairclough
(2000, p. 14) and Eggins & Martin (1997). The latter say that "the linguistic features selected in a text will
encode contextual dimensions, both of its immediate context of production (i.e., register) and of its
generic identity (i.e., genre), what task the text is achieving in the culture" (p. 237), although they do not
clearly set out the difference in terms of a difference in point of view, as I have done above. Instead, as
we have seen, they attempt in rather vague terms to define register as a variety "organised by
metafunction" (Field, Tenor, Mode) and genre as something "above and beyond metafunctions." In
Biber's (1994) survey of this area of terminological confusion, he mentions the use of terminology by
Couture (1986), but fails to note a crucial distinction apparently made by the author:
Couture's examples of genres and registers seem to be more clearly distinguished than in
other studies of this type. For example, registers include the language used by preachers
in sermons, the language used by sports reporters in giving a play-by-play description of
a football game, and the language used by scientists reporting experimental research
results. Genres include both literary and non-literary text varieties, for example, short
stories, novels, sonnets, informational reports, proposals, and technical manual. [all italics
added] (Finegan & Biber, 1994, p. 52)
Biber does not point out that a key division of labour between the two terms is being made here which has
nothing to do with the particular examples of activity types, domains, topics, and so forth: whenever
register is used, Couture is talking about "the language used by…", whereas when genre is used, we are
dealing with "text varieties" (i.e., groupings of texts).
I contend that it is useful to see the two terms genre and register as really two different angles or points of
view, with register being used when we are talking about lexico-grammatical and discoursal-semantic
patterns associated with situations (i.e., linguistic patterns), and genre being used when we are talking
about memberships of culturally-recognisable categories. Genres are, of course, instantiations of registers
(each genre may invoke more than one register) and so will have the lexico-grammatical and discoursal-
Language Learning & Technology
46
David Lee
Genres, Registers, Text Types, Domains, and Styles
semantic configurations of their constitutive registers, in addition to specific generic socio-cultural
expectations built in.
Genres can come and go, or change, being cultural constructs which vary with the times, with fashion,
and with ideological movements within society. Thus, some sub-genres of "official documents" in
English have been observed to have changed in recent times, becoming more conversational, personal,
and familiar, sometimes in a deliberate way, with manipulative purposes in mind (Fairclough 1992). The
genres have thus changed in terms of the registers invoked (an aspect of intertextuality), among other
changes, but the genre labels stay the same, since they are descriptors of socially constituted, functional
categories of text.
Much of the confusion comes from the fact that language itself sometimes fails us, and we end up using
the same words to describe both language (register or style) and category (genre). For example,
"conversation" can be a register label ("he was talking in the conversational register"), a style label ("this
brochure employs a very conversational style"), or a genre label ("the [super-]genre of casual/face-to-face
conversations," a category of spoken texts). Similarly, weather reports are cited by Ferguson (1994) as
forming a register (from the point of view of the language being functionally adapted to the situational
purpose), but they are surely also a genre (a culturally recognised category of texts). Ferguson gives
"obituaries" as an example of a genre, but fails to recognise that there is not really a recognisable
"register of obituaries" only because the actual language of obituaries is not fixed or conventionalised,
allowing considerable variation ranging from humorous and light to serious and ponderous.
Couture (1986) also offers an additional angle on the distinction between register and genre:
While registers impose explicitness constraints at the level of vocabulary and syntax,
genres impose additional explicitness constraints at the discourse level … Both literary
critics and rhetoricians traditionally associate genre with a complete, unified textual
structure. Unlike register, genre can only be realized in completed texts or texts that can
be projected as complete, for a genre does more than specify kinds of codes extant in a
group of related texts; it specifies conditions for beginning, continuing, and ending a text.
(p.82)
The important point being made here is that genres are about whole texts, whereas registers are about
more abstract, internal/linguistic patterns, and, as such, exist independently of any text-level structures.
In summary, I prefer to use the term genre to describe groups of texts collected and compiled for corpora
or corpus-based studies. Such groups are all more or less conventionally recognisable as text categories,
and are associated with typical configurations of power, ideology, and social purposes, which are
dynamic/negotiated aspects of situated language use. Using the term genre will focus attention on these
facts, rather than on the rather static parameters with which register tends to be associated. Register has
typically been used in a very uncritical fashion, to invoke ideas of "appropriateness" and "expected
norms," as if situational parameters of language use have an unquestionable, natural association with
certain linguistic features and that social evaluations of contextual usage are given rather than
conventionalised and contested. Nevertheless, the term has its uses, especially when referring to that body
of work in sociolinguistics which is about "registral variation," where the term tells us we are dealing with
language varying according to socio-situational parameters. In contrast, the possible parallel term
"genre/generic variation" does not seem to be used, because while you can talk about "language variation
according to social situations of use," it makes no sense to talk about "categories of texts varying
according to the categories they belong to." Of course, I am not saying that genres do not have internal
variation (or sub-genres). I am saying that "genre variation" makes no sense as a parallel to "register
variation" because while you can talk about language (registers) varying across genres, it is tautologous to
talk about genres (text categories) varying across genres or situations. In other words, when we study
differences among genres, we are actually studying the way the language varies because of social and
Language Learning & Technology
47
David Lee
Genres, Registers, Text Types, Domains, and Styles
situational characteristics and other genre constraints (registral variation), not the way texts vary because
of their categorisation.
Genres as Basic-Level Categories in a Prototype Approach
One problem with genre labels is that they can have so many different levels of generality. For example,
some genres such as "academic discourse" are actually very broad, and texts within such a high-level
genre category will show considerable internal variation: that is, individual texts within such a genre can
differ significantly in their use of language (as, for example, Biber, 1988, has shown). A second problem,
as Kress noted, is that different "genres" can be based on so many different criteria (domain, topic,
participants, setting, etc.).
There is a possible solution to this. Steen (1999) is an interesting attempt at applying prototype theory
(Rosch, 1973a, 1973b, 1978; Taylor, 1989) to the conceptualisation of genre (and hence to the
formalisation of a taxonomy of discourse; cf. also Paltridge, 1995, who made a similar argument but from
a different perspective). Basically, the prototype approach can be summarised by Table 2 (which
represents my understanding of Steen's ideas; my own suggestions are marked by "?"):
Table 2. A Prototype Approach to Genre
SUPERORDINATE
Mammal
BASIC-LEVEL
Dog/Cat
SUBORDINATE
[PROTOTYPE]
Cocker
spaniel /
Siamese
Literature ["SUPERGENRE"?]
Novel, Poem, Drama
[GENRE]
Advertising ["SUPERGENRE'"]
Western, Romance,
Adventure [SUB-GENRE]
Print ad, Radio ad, TV ad, Tshirt ad [SUB-GENRE]
Advertisement [GENRE]
Basic-level categories are those which are in the middle of a hierarchy of terms. They are characterised as
having the maximal clustering of humanly-relevant properties (attributes), and are thus distinguishable
from superordinate and subordinate terms: "It is at the basic level of categorization that people
conceptualize things as perceptual and functional gestalts" (Taylor, 1989, p. 48). A basic-level category,
therefore, is one for which human beings can easily find prototypes or exemplars, as well as less
prototypical members. Subordinate-level categories, therefore, operate in terms of prototypes or fuzzy
boundaries: some are better members than others, but all are valid to some degree because they are
cognitively salient along a sliding scale. We can also extend this fuzzy-boundary approach to the other
levels (basic-level and superordinate) to account for all kinds of mixed genres and super-genres (e.g., to
what degree can Shakespeare's dramas be said to be different from poetry? When does good advertising
become a form of literature or vice versa?).
Steen (1999) applies the idea of basic-level categories and their prototypes to the conceptualisation of
genre as follows:
It is presumably the level of genre that embodies the basic level concepts, whereas
subgenres are the conceptual subordinates, and more abstract classes of discourse are the
superordinates. Thus the genre of an advertisement is to be contrasted with that of a
sermon, a recipe, a poem, and so on. These genres differ from each other on a whole
range of attributes … The subordinates of the genre of the advertisement are less distinct
from each other. The press advertisement, the radio commercial, the television
commercial, the Internet advertisement, and so on, are mainly distinguished by one
feature: their medium. The superordinate of the genre of the ad, advertising, is also
systematically distinct from the other superordinates by means of only one principal
attribute, the one of domain: It is "business" for advertising, but it exhibits the respective
Language Learning & Technology
48
David Lee
Genres, Registers, Text Types, Domains, and Styles
values of "religious", "domestic" and "artistic" for the other examples. [all italics added]
(p. 112)
Basically, Steen is proposing that we can recognise genres by their cognitive basic-level status: True
genres, being basic-level, are maximally distinct from one another (in terms of certain "attributes" to be
discussed below), whereas members at the level of sub-genre (which operate on a prototype basis) or
"super-genre"8 have fewer distinctions among themselves.
The proposal is for genres to be treated as basic-level categories which are characterised by
(provisionally) a set of seven attributes: domain (e.g., art, science, religion, government), medium (e.g.,
spoken, written, electronic), content (topics, themes), form (e.g., generic superstructures, à la van Dijk
(1985), or other text-structural patterns), function (e.g., informative, persuasive, instructive), type (the
rhetorical categories of "narrative," "argumentation," "description," and "exposition") and language
(linguistic characteristics: register/style[?]). Steen offers only a preliminary sketch of this approach to
genre (and hence to a taxonomy of discourse), and, as it stands, it appears to be too biased towards written
genres. Other attributes can (and should) be added: for example, setting or activity type, to distinguish a
broadcast interview from a private interview; or audience level, to distinguish public lectures from
university lectures (and both attributes to distinguish the latter from school classroom lessons). Another
point is that dependencies among the attributes exist (many values for domain, medium, and content are
typically co-selected, for instance). Nevertheless, the approach looks like a promising one, and when fully
developed will help us sort out genres from sub-genres.
"GENRES" IN CORPORA
Applying this "fuzzy categories" way of looking at genre to corpus studies, we can see that the categories
to which texts have been assigned in existing corpora are sometimes genres, sometimes sub-genres,
sometimes "super-genres" and sometimes something else altogether. (This is undoubtedly why the catchall term "text category" is used in the official documentation for the LOB and ICE-GB corpora. Most of
these "text categories" are equivalent to what I am calling "genres" in the BNC Index.) For example,
consider ICE-GB corpus categories in Table 3.
Table 3. Text Categories in ICE-GB (figures in parentheses indicate the number of 2,000-word texts in
each category)
Medium I
Medium II (?) or
Interaction Type (?)
Super-genre or
Function
Private (100)
Dialogue
(180)
Public (80)
SPOKEN
(300)
Monologue
(100)
Unscripted (70)
Scripted (30)
Mixed
(20)
Language Learning & Technology
Genres or Sub-genres
face-to-face conversations
(90)
phone calls (10)
classroom lessons (20)
broadcast discussions (20)
broadcast interviews (10)
parliamentary debates (10)
legal cross-examinations (10)
business transactions (10)
spontaneous commentaries
(20)
unscripted speeches (30)
demonstrations (10)
legal presentations (10)
broadcast talks (20)
non-broadcast speeches (10)
broadcast news (20)
49
David Lee
Genres, Registers, Text Types, Domains, and Styles
Non-Printed
(50)
Non-professional
writing (20)
Correspondence (30)
Academic writing (40)
WRITTEN
(200)
Printed
(150)
Non-academic writing
(40)
Reportage (20)
Instructional writing
(20)
Persuasive writing
(10)
Creative writing (20)
student essays (10)
student examination scripts
(10)
social letters (15)
business letters (15)
humanities (10)
social sciences (10)
natural sciences (10)
technology (10)
humanities (10)
social sciences (10)
natural sciences (10)
technology (10)
press news reports (20)
administrative/regulatory (10)
skills/hobbies (10)
press editorials (10)
novels/stories (20)
The top row of the table is my attempt at describing what attribute(s) or levels the terms within each
column represent. The terms within the last column are what end-users of the corpus normally work with,
and can be seen to be either genres or sub-genres, viewed from a prototype perspective (e.g., "broadcast
interview" is probably best seen as a sub-genre of "interview," differing mainly in terms of the setting,
and business letters differ from social letters mainly in terms of domain). Most of the terms in the third
column can be said to describe "super-genre" or "super-super-genres," with the exception of "instructional
writing" and "persuasive writing" (shaded), which seem more like functional labels.9
The British National Corpus (BNC), in contrast, has no text categorisation for written texts beyond that of
domain, and no categorisation for spoken texts except by "context" and demographic/socio-economic
classes. The following diagram shows the breakdown of the BNC:
Figure 4. Domains in the British National Corpus (BNC)
Language Learning & Technology
50
David Lee
Genres, Registers, Text Types, Domains, and Styles
It can be seen that for the written texts, domains are broad "subject fields" (see Burnard, 1995). These are
closely paralleled for the spoken texts by even broader "context" categories covering the major spheres of
social life (leisure, business, education, and institutional/public contexts). Apart from considering all the
demographically sampled conversations as constituting one super-genre of "casual conversation" and all
the written imaginative texts as forming a super-genre "literature," genres cannot easily be found at all
under the current domain scheme. More about these BNC categories and their (non-) usefulness will be
said in later sections.
Moving on to the LOB corpus (Table 4), we see that it is mostly composed of a mixture of genre and subgenre labels:
Table 4. Genres in the LOB Corpus
LOB Corpus (Written)
Press: reportage
Press: editorial
Press: reviews
Religion
Skills, trades & hobbies
Popular Lore
Belles lettres, biography, essays
Misc (gov docs, foundation reports,
industry reports, college reports, inhouse organ)
Learned/scientific writings
General fiction
Mystery & detective fiction
Science fiction
Adventure & western fiction
Romance & love story
Humour
Examined in terms of Steen's genre attributes, the shaded cells in Table 4 above are clearly sub-genres of
some general super-genre of "fiction" (both "novels" and "short stories" -- the basic-level genres in
Steen's taxonomy -- are included). "Religion," on the other hand, appears to be a domain label since it
brings together disparate books, periodicals and tracts whose principal common feature is that they are
concerned with religion (in this case Christianity).10
Why do we have all these different levels or types of categorisation? It is tempting to believe that this is
the case because the corpus compilers felt that these were the most useful, salient, or interesting
categories -- perhaps these are basic-level genres, or prototypical sub-genres (especially those which keep
appearing in different corpora). But is it a problem that the categories differ in terms of their defining
attributes and in terms of generality? My personal opinion is that it is not. Cranny-Francis (1993, p. 109)
touches on this point and asks:
If "genre" has this range of different meanings and classificatory procedures -- by formal
characteristics, by field -- we might ask what is its value? Why is it so useful to
educators, linguists and critics, as well as to publishers, filmmakers, booksellers, readers
and viewers?
She suggests that the reason is simply because genre "is never simply formal or semantic [based on field
or subject area] and it is not even simply textual." Using the terms as defined in this paper, we could
Language Learning & Technology
51
David Lee
Genres, Registers, Text Types, Domains, and Styles
paraphrase this to read, "genre is never just about situated linguistic patterns (register), functional cooccurrences of linguistic features (text types), or subject fields (domain), and it is not even simply about
text-structural/discoursal features (e.g., Martin's [1993] generic stages, Halliday & Hasan's [1985] GSPs,
van Dijk's [1985] macrostructures, etc.)." It is, in fact, all of these things. This makes it a messy and
complex concept, but it is also what gives it its usefulness and meaningfulness to the average person.
They are all genres (whether sub- or super-genres or just plain basic-level genres).
The point of all this is that we need not be unduly worried about whether we are working with genres,
sub-genres, domains, and so forth, as long as we roughly know what categories we are working with and
find them useful. We have seen that the categories used in various corpora are not necessarily all "proper"
genres in a traditional/rhetorical sense or even in terms of Steen's framework, but they can all be seen as
"genres" at some level in a fuzzy-category, hierarchical approach. A genre is a basic-level category,
which has specified values for most of the seven attributes suggested above and which is maximally
distinct from other categories at the same level. "Sub-genres" and "super-genres" are simply other (fuzzy)
ways of categorising texts, and have their uses too. The advantages of the prototype approach are that (a)
gradience or fuzziness between and within genres is accorded proper theoretical status, and (b)
overlapping of categories is not a problem (thus texts can belong to more than one genre).
From one point of view, until we have a clear taxonomy of genres, it may be advisable to put most of our
corpus genres in quotation marks, because genre is also often used in a folk linguistic way to refer to any
more-or-less coherent category of text which a mature, native speaker of a language can easily recognise
(e.g., newspaper articles, radio broadcasts), and there are no strict rules as to what level of generality is
allowable when recognising genres in this sense. In a prototype approach, however, it does not seriously
matter. Some text categories may be based more on the domain of discourse (e.g., "business" is a domain
label in the BNC for any spoken text produced within a business context, whether it is a committee
meeting or a monologic presentation). Spoken texts, which tend to be even more loosely classified in
corpus compilations, may simply be categorised on whether they are spontaneous or planned, broadcast or
spoken face-to-face, as in the London-Lund Corpus, for instance, which means the categories are "genres"
only in a very loose sense. This goes to show that there are still serious issues to grapple with in the
conceptualisation of spoken genres (written ones are, in contrast, typically easier to deal with) but that a
prototype approach, with its many levels of generality and a set of defining attributes, may help to tighten
up our understanding.
These brief visits to the various corpora suggest that there should not be any serious objections
(theoretical or otherwise) to the use of the term genre to describe most of the corpus categories we have
seen. Such usage reflects a looser approach, but there is no requirement for genres to actually be
established literary or non-literary genres, only for them to be culturally recognisable as groupings of
texts at some level of abstraction. The various corpora also show us that the recognition of genres can be
at different levels of generality (e.g., "sermons" vs. "religious discourse"). In the LOB corpus, the
category labels appear to be a mix: some are sub-genre labels (e.g., "mystery fiction" and "detective
fiction"), while others are more properly seen as domain labels ("Skills, trades, & hobbies," "Religion").
My own preferred approach with regard to developing a categorisation scheme is to use genre categories
where possible, and domain categories where they are more practical (e.g., "Religion"11).
THE BNC JUNGLE: THE NEED FOR A PROPER NAVIGATIONAL MAP
Having clarified some of the terminology and concepts and looked at the categories used in a few existing
corpora, I want to move on to consider some of the problems with the British National Corpus as it now
stands, and then introduce a new resource called the BNC Index which (it is hoped) will make it easier for
researchers and language learners/teachers to navigate through the numerous texts to find what they need.
Language Learning & Technology
52
David Lee
Genres, Registers, Text Types, Domains, and Styles
Some Existing Problems
Overly Broad Categories. The first problem that prompts the need for a navigational map has to do with
the broadness and inexplicitness of the BNC classification scheme. For example, academic and nonacademic texts under the domains "Applied Science," "Arts," "Pure/Natural Science," "Social Science,"
and so forth, are not explicitly differentiated. (It is interesting to note, in this connection, that under the
attribute of "genre" in the "text typology" of Atkins et al., 1992, p. 7, no mention is made of the useful
distinction between academic and non-academic prose, even though this is employed in one of the earliest
corpora, the LOB corpus, where the "learned" category has proved to be among the most popular with
linguists.)
Another example that points to the inadequacy of the BNC's categorisation of texts is the way
"imaginative" texts are handled. A wide variety of imaginative texts (novels, short stories, poems, and
drama scripts) is included in the BNC, which is a good thing because the LOB, for example, does not
contain poetry or drama. However, such inclusions are practically wasted if researchers are not actually
able to easily retrieve the sub-genres on which they want to work (e.g., poetry) because this information is
not recorded in the file headers or in any documentation associated with the BNC. There is at present no
way to know whether an "imaginative" text actually comes from a novel, a short story, a drama script or a
collection of poems (unless the title actually reflexively includes the words "a novel" or "poems by
XYZ"). For example, given text files with titles like "For Now" or "The kiosk on the brink," there is no
way of knowing that both of these are actually collections of poems. All the BNC bibliography and file
headers tell us is that these are "imaginative" texts, taken from "books."
Classification Errors and Misleading Titles. In the process of some previous research, I found that there
were many classificatory mistakes in the BNC (and also in the BNC Sampler): some texts were classified
under the wrong category, usually because of a misleading title. For the same reason, even though a
limited, computer-searchable bibliographical database of the BNC texts exists12 (compiled by Adam
Kilgarriff), not enough information is included there, and researchers cannot always rely on the titles of
the files as indications of their real contents: For example, many texts with "lecture" in their title are
actually classroom discussions or tutorial seminars involving a very small group of people, or were
popular lectures (addressed to a general audience rather than to students at an institution of higher
learning). A good reason for a navigational map, then, is so that we can go beyond the existing
information we have about the BNC files (and beyond the mistakes) and to provide genre classifications,
so that researchers do not have just the titles of files to go on.
Sub-Genres Within a Single File. Another problem, which will only be touched on briefly because there
is no real solution, is that some BNC files are too big and ill-defined in that they contain different genres
or sub-genres. For example, newspaper files described in the title as containing "editorial material"
include letters-to-the-editor, institutional editorials (those written by the editor), and personal editorials
(commentaries/personal columns written by journalists or guest writers), and some courtroom files
contain both legal cross-examinations (which are dialogic) as well as legal presentations (summing-up
monologues by barristers or judges). This is a problem for lines of linguistic enquiry that rely on
relatively homogeneous genres. It is a problem, however, which cannot be solved easily because the
splitting of files is beyond the scope of most end-users of the BNC. The problem is just mentioned here as
a caution to researchers.
Domains Versus Genres: The BNC Sampler & Why We Need Genre Information
The BNC Users' Reference Guide states that only three criteria were used to "balance" the corpus:
domain, time, and medium. In choosing texts for inclusion into the BNC Sampler (the 2-million word
sub-set of the BNC), domain was probably the most important criterion used to ensure a wide-enough
coverage of a variety of texts. On the BNC Web page for the Sampler, the following comment on its
representativeness is made:
Language Learning & Technology
53
David Lee
Genres, Registers, Text Types, Domains, and Styles
In selecting from the BNC, we tried to preserve the variety of text-types represented, so
the Sampler includes in its 184 texts many different genres [italics added] of writing and
modes of speech.
It should be noted that no real claim to representativeness is made, and that what they really meant was
that many different texts were chosen on the basis of domain and other criteria.13 The fact that the
Sampler contains many different genres is not in doubt, but the texts were not chosen on this basis, since
they had no genre classification, and hence the Sampler cannot (and, indeed, it does not) claim to be
representative in terms of "genre."
It is my belief that it is because "domain" is such a broad classification in the BNC that the Sampler
turned out to be rather unrepresentative of the BNC and of the English language. Anyone wishing to use
the Sampler should be under no illusion that it is a balanced corpus or that it represents the full range of
texts as in the full BNC. The Sampler may be broadly balanced in terms of the domains, but when broken
down by genre, a truer picture emerges of exactly how (un)representative it really is. Appendix A lists
missing or unrepresentative genres in the Sampler BNC which demonstrate this.
"Genre" is perhaps a more insightful classification criterion than "domain," as least as far as getting a
representatively balanced corpus is concerned. If the compilers of the BNC Sampler had known the genre
membership of each BNC text, they would probably have created a more balanced and representative subcorpus. As things stand, however, any conclusions about "spoken English" or "written English" made on
the basis of the BNC Sampler will have to be evaluated very cautiously indeed, bearing in mind the
genres missing from the data.
There is another example of how large, undifferentiated categories similar to domain can unhelpfully
lump disparate kinds of text together. Wikberg (1992) criticises the LOB text category E ("Skills, trades,
and hobbies") as being too baggy or eclectic. He demonstrates how, on the evidence of both external and
internal criteria, the texts in Category E can actually be better sub-classified into "procedural" versus
"non-procedural" discourse. He also notes that it is not just text categories that can be heterogeneous.
Sometimes texts themselves are "multitype" or mixed in terms of having different stages with different
rhetorical or discourse goals. He thus concludes with the following comment:
An important point that I have been trying to make is that in the future we need to pay
more attention to text theory when compiling corpora. For users of the Brown and the
LOB corpora, and possibly other machine-readable texts as well, it is also worth noting
the multitype character of certain text categories. (p. 260)
This is a piece of advice worth noting.
THE BNC (BIBLIOGRAPHICAL) INDEX
The BNC Index spreadsheet I am about to describe was created as one solution to the previously
mentioned problems and difficulties. It is similar to the plain text ones prepared by Adam Kilgarriff that I
have benefited from and found rather useful.14 However, those files do not contain all the details which
are needed for compiling your own sub-corpus (author type, author age, author sex, audience type,
audience sex, section of text sampled, [topic] keywords, etc.). Sebastian Hoffmann's files were useful too,
in a complementary way, but these do not include (a) keywords and (b) the full bibliographical details of
files. A third existing resource, the "bncfinder.dat" file that comes with the standard distribution of the
BNC (version 1) has most of the header information, but in the form of highly abbreviated numeric codes,
and also does not include any bibliographical information about the files or keywords. The BNC Index
consolidates the kinds of information available in the above three resources, but, in addition, includes (a)
BNC-supplied keywords (as entered in the file headers by the compilers); (b) COPAC keywords15 for
published non-fiction texts16 (topic keywords entered by librarians); (c) full bibliographical details
Language Learning & Technology
54
David Lee
Genres, Registers, Text Types, Domains, and Styles
(including title, date and publisher for written texts, and number of participants for spoken files); (d) an
extra level of text categorisation, "genre," where each text is assigned to one of the 70 genres or subgenres (24 spoken and 46 written) developed for the purposes of this Index; (e) a column supplying
"Notes & Alternative Genres," where texts which are interdisciplinary in subject matter or which can be
classified under more than one genre are given alternative classifications. Also entered here are extra
notes about the contents of files (e.g., where a single BNC file contains several sub-genres within it, such
as postcards, letters, faxes, etc., these are noted). These extra notes are the result of random, manual
checks: not all files have been subjected to such detailed analysis. For some written texts taken from
books, the title of the book series is also given under this column (e.g., file BNW, "Problems of
unemployment and inflation," is part of the Longman book series "Key issues in economics and
business").
It is hoped that this will be a comprehensive, user-friendly, "one-stop" database of information on the
BNC. All the information is presented using a minimum of abbreviations or numeric codes, for ease of
use. For example, m_pub (for "miscellaneous published") is used instead of a cryptic numeric code for the
medium of the text, and domains are likewise indicated by abbreviated strings (e.g., W_soc_science,
S_Demog_AB) rather than numbers. It should be noted that I carried out the genre categorisation of all the
texts by myself: This ensures consistency, but it also means that some decisions may be debatable. The
pragmatic point of view I am taking is that something is better than nothing, and that it is beneficial to
start with a reasonable genre categorisation scheme and then let end-users report problem/errors and
dictate future updates and improvements.
When compiling a sub-corpus for the purpose of research, classroom concordancing, genre-based
learning, and so forth, you need all the available information you can get. With the BNC Index, it is now
possible, for example, to separate children's prose fiction from adult prose fiction by combining
information from the "audience age" field and the newly introduced "genre" field (using domain alone
would have included poems as well).
All the information in the spreadsheet is up-to-date and as accurate as possible, and supersedes the
information given in the actual file headers and the "bncfinder.dat" file distributed with the BNC (version
1), both of which are known to contain many errors. Changes and corrections to erroneous classifications
were made both after extensive manual checks and on the basis of error reports made by others. The
following section lists and explains all the columns/fields of information given in the BNC Index. Some
of the genre categories are still being worked on, however, and may change in the final release of the
Index.
Notes on the BNC Index
For spoken files, there are only eight relevant fields of information, giving the following self-explanatory
details (abbreviations are explained in Table 6):17
File
ID
Domain
Genre
Keywords
natural &
S_cg_ed
pure
FLX
S_classroom
ucation
science;
chemistry
Word Interaction
Mode
Total
Type
5,142
Dialogue
S
Bibliographical Details
11th year science lesson: lecture in
chemistry of metal processing
(Edu/inf). Rec. on 23 Mar 1993 with 2
partics, 381 utts
Note that Mode only distinguishes broadly between spoken (S) and written (W). To further restrict
searches to only "demographic" files or only "context-governed" files, the Domain field should be used.
For written files, there can be up to 19 fields of information (depending on the file: fields which do not
apply to a particular file are left blank). As an example, the entry for AE7 is as follows:
Language Learning & Technology
55
David Lee
Genres, Registers, Text Types, Domains, and Styles
Notes &
Alternative
Genres
W_nat_ W_non_ Also
science ac_nat_ W_non_ac_hu
science manities_arts
File
Medium Domain Genre
ID
COPAC
Keywords
Keywords
AE book
7
Biology Philosophy
molecular
genetics
Total
Circulation Period
Sampling
Words
Status
Composed
The problems of biology. 36,115 mid
M
1985-1994
Maynard Smith, John.
Oxford: OUP, 1989, pp.
9-109. 1686 s-units.
Bibliographical details
Audience Audience Audience
Age
Sex
Level
adult
Mode
W
mixed
Author Author
Age
Sex
60+ yrs Male
high
Author
Type
Sole
The information fields are explained more fully in the BNC User's Reference Guide, but here is a brief
explanation of some of them:
The table above tells us that file AE7 is a sample extracted from the middle (Sample Type) of a book
(Medium), whose Circulation Status is Medium (this refers to the number of receivers of the text),18
whose author (Author Age/Sex/Type) is 60+ yrs old (age band 6 in terms of BNC codes), is Male and is
the Sole author of the text. The text has been manually classified as "non_academic prose, natural
sciences" (Genre), although it also deals with philosophical issues (COPAC Keywords) and thus may also
be considered under "W_non_ac_humanities_arts." The target audience for the text are adults, of both
sexes (mixed), and high-level (original BNC numerical code="level 3"). The BNC compilers have
classified it under "natural sciences" (Domain),19 and the text was composed in the period 1985-1994
(Period Composed).20 The Bibliographical Details field gives us the title of the text (The Problems of
Biology), its author, publisher, and so forth, and an indication of the number of sentences ("s-units"),
while the (BNC compilers') Keywords field supplies the detail that the book is about molecular genetics
(COPAC and BNC keywords tend to be about topic, and are sometimes useful for sub-genre
identification). The page numbers under Bibliographical Details were, in this case and many others, not
actually given in the original BNC bibliography, but were manually added to the Index after I had
searched in the file for the page break SGML elements. This is to allow proper, complete referencing (the
original bibliographical reference would have been "pp. ??"). However, some files did not have page
breaks encoded at all, and thus their bibliographical references remain incomplete.
A list of all possible values for the closed-set fields (the keyword fields are open-ended) is given in
Appendix B.
With all these fields of information put together in a one database/spreadsheet, where they can be
combined with one another, it becomes easy to scan the BNC for whatever particular kinds of text you are
interested in.
Further Notes on the Genre Classifications
The genre categories used in the BNC Index were chosen after a survey of the genre categorisation
schemes of other existing corpora (e.g., LLC, LOB, ICE-GB) and will thus be familiar to users and
compatible with these other corpora, allowing comparative studies based on genres taken from different
corpora. These genre labels have been carefully selected to capture as wide a range as possible of the
numerous types of spoken and written texts in the English language, and the divisions are more finegrained than the domain categories used in the BNC itself. Note that some genre labels are hierarchically
nested so that, for example, if you simply want to study "prototypical academic English" and are not
concerned with the sub-divisions into social sciences, humanities, and so forth, you can find all such files
by searching for "W_ac*" and specifying "high" for "audience level."21 Or if you are interested in the
Language Learning & Technology
56
David Lee
Genres, Registers, Text Types, Domains, and Styles
language of the social sciences, whether spoken or written, you can similarly use wildcards to search for
"*_soc_science." In general, where further sub-genres can be generated on-the-fly through the use of
other classificatory fields, they are not given their own separate genre labels, to avoid clutter. For
instance, "academic texts" can be further sub-divided into" (introductory) textbooks" and "journal
articles," but since this can very easily be done by using the medium field (i.e., by choosing either "book"
or "periodical"), the sub-genres have not been given their own separate labels. Instead, end-users are
encouraged to use available fields to create their own sub-classificatory permutations. The "genre" labels
here are therefore meant to provide starting points, not a definitive taxonomy.
Table 5 shows the breakdown of the genre categories used in the BNC Index spreadsheet more clearly
than in the earlier table, and also shows the super-genres that some researchers may want to study (made
possible by the use of hierarchical genre labels).
Table 5. Breakdown of BNC Genres in proposed classificatory scheme22
BNC SPOKEN
Super Genre
S_brdcast_discussn
S_brdcast_documentar
Broadcast
y
S_brdcast_news
S_classroom
S_consult
S_conv
S_courtroom
S_demonstratn
S_interview
Interviews
S_interview_oral_histor
y
S_lect_commerce
S_lect_humanities_arts
S_lect_nat_science
Lectures
S_lect_polit_law_edu
S_lect_soc_science
S_meeting
S_parliament
S_pub_debate
S_sermon
S_speech_scripted
Speeches
S_speech_unscripted
S_sportslive
S_tutorial
S_unclassified
Language Learning & Technology
BNC WRITTEN
W_ac_humanities_arts
W_ac_medicine
W_ac_nat_science
W_ac_polit_law_edu
W_ac_soc_science
W_ac_tech_engin
W_admin
W_advert
W_biography
W_commerce
W_email
W_essay_sch
W_essay_univ
W_fict_drama
W_fict_poetry
W_fict_prose
W_hansard
W_institut_doc
W_instructional
W_letters_personal
W_letters_prof
W_misc
W_news_script
W_newsp_brdsht_nat_arts
W_ newsp_brdsht_nat
_commerce
W_ newsp_brdsht_nat _editorial
W_ newsp_brdsht_nat _misc
W_ newsp_brdsht_nat
_reportage
W_ newsp_brdsht_nat _science
W_ newsp_brdsht_nat _social
Super Genre
Academic
prose
Non-printed essays
Fiction23
Letters
Broadsheet
national
newspapers
57
David Lee
Genres, Registers, Text Types, Domains, and Styles
W_ newsp_brdsht_nat _sports
W_newsp_other_arts
W_newsp_other_commerce
W_newsp_other_report
W_newsp_other_science
W_newsp_other_social
W_newsp_other_sports
W_newsp_tabloid
W_non_ac_ humanities_arts
W_non_ac_medicine
W_non_ac_nat_science
W_non_ac_polit_law_edu
W_non_ac_soc_science
W_non_ac_tech_engin
W_pop_lore
W_religion
Regional
& local
newspapers
Tabloid newspapers
Non-academic
prose
(non-fiction)
It will be noted that aspects of this genre classification scheme mirror the ICE-GB corpus (see Table 5 for
the ICE-GB categories), although I have made finer distinctions in some cases (e.g., the lecture and
broadsheet sub-genres) and grouped texts differently (e.g., I have "nested" all broadsheet newspaper
material together rather than into separate functional groups as in the ICE-GB (cf. "Reportage" and
"Persuasive writing" in Table 5).
In some respects, the scheme also follows the Lancaster-Oslo/Bergen (LOB) corpus quite closely. This
was done deliberately, to facilitate diachronic/comparative research.24 For example, here is how the
various subject disciplines are categorised in the LOB corpus and in the BNC Index:
Table 6. LOB Corpus Categories Broken Down into Component Disciplines
LOB (& BNC Index) Category
Humanities
Social sciences
Natural sciences
Medicine
Politics, Law, Education
Technology & Engineering
Subjects/Disciplines
Philosophy, History, Literature, Art, Music
Psychology, Sociology, Linguistics, Social
Work
Physics, Chemistry, Biology
--Computing, Engineering
One difference from the LOB corpus is that economics texts in the BNC Index are not put under "politics,
law and education," but are instead put under the "W_commerce" genre. Also, archaeology and
architecture have been classified as humanities or arts subjects under the present scheme, while
geography is classed either as a social or natural science depending on the branch of geography. Geology
has been classed as a natural science. One mathematics textbook file for primary/elementary schools was
simply put under "miscellaneous," while university-level mathematical texts were put under either
"natural_sciences" or "technology & engineering" depending on whether they were pure or applied.25
It should also be noted that some texts are a mixture of disciplines (e.g., history and politics often go hand
in hand, but the two are separate categories under this scheme). In such cases, a more or less arbitrary
assignment was made, based on what was judged to be the dominant point of view in the text, and, in the
case of printed publications, after consultation of the keywords for the text in library catalogues (see
discussion which follows).
Language Learning & Technology
58
David Lee
Genres, Registers, Text Types, Domains, and Styles
Some genres are deliberately broad because they can be easily sub-divided using other fields. For
example, "institutional documents" includes government publications (including "low-brow'"
informational booklets and leaflets/brochures), company annual reports, and university calendars and
prospectuses. However, these texts can be fairly easily separated out using "Medium," "Audience level,"
or "Keywords."
The "non-academic" genres relate to written texts (mainly books) sometimes called "non-fiction" which
have subject matters belonging to one of the disciplines listed above. They are usually texts written for a
general audience, or "popularisations" of academic material, and are thus distinguished from texts in the
parallel academic genres (which are targeted at university-level audiences, insofar as this can be
determined). In deciding whether a text was academic or not, a variety of cues was used: (a) the "audience
level (of difficulty)" estimated by the BNC compilers (coded in the file headers) (b) whether COPAC lists
the book as being in the "short loan" collections of British universities (this works in one direction only:
absence is not indicative of a work not being academic) (c) the publisher and publication series (academic
publishers form a small and recognisable set, and some books have academic series titles, which help to
place them in context).
The spoken "lecture" genres in the Index refer only to university lectures. Thus, many "A"-level or nonuniversity lectures are classified as "S_speech_unscripted." Similarly, "S_tutorial" refers only to
university-level tutorials or classroom "seminars." Other non-tertiary-level or home tutorial sessions are
classified under "S_classroom."
Genres labels are deliberately non-overlapping for spoken and written texts. For example, parliamentary
speeches audio-transcribed by the BNC transcribers are labelled "S_parliament" for the spoken corpus,
whereas the parallel, official/published version is labelled "W_hansard" for the written corpus. Also, for
spoken texts, the "leftover" files (which do not really belong to any of the other spoken genres used in this
scheme, e.g., baptism ceremony, auctions, air-traffic control discourse, etc.) are labelled as
"S_unclassified," whereas leftover written files are labelled "W_misc."
As mentioned in the first part of this paper, deciding what a coherent genre or sub-genre is can be far
from easy in practice, as (sub-)genres can be endlessly multiplied or sub-divided quite easily. Moreover,
the classificatory decisions of corpus compilers may not necessarily be congruent with that of researchers.
For example, what is considered "applied science"? In the present scheme, "applied science" excludes
medicine (which is instead placed in a category of its own), engineering (which is put under
"technology"), and computer science (also under "technology"). For the purposes of the BNC Index, a
particular "level of delicacy" has been decided on for the genre scheme, based on categories already in
use in existing corpora and in the research literature. Users may further sub-divide or collapse/combine
genres as they see fit. The present scheme is only an aid; it helps to narrow down the scope of any subcorpus building task. In this connection, it should be noted that due to the way the material was recorded
and collated, many of the spoken files (especially "conversation") are less well-defined than the written
ones because they are made up of different task and goal types, as well as varying topics and participants
(e.g., a single "conversation" file can contain casual talk between both equals and unequals, and "lecture"
files often contain casual preambles and concluding remarks in addition to the actual lectures themselves).
Researchers wanting discoursally well-defined and homogeneous texts will have to sub-divide texts
themselves.
If the distribution of linguistic features among "genres" is important to a particular piece of research, then
obviously the research can be affected or compromised by the definition/constitution of the "genres" in
the first place. For this reason, users of the BNC Index are advised to read the notes/documentation given
here, and to be clear what the various domain and genre labels mean.26 To illustrate: the BNC compilers
have classified some texts into the "natural/pure sciences" domain (e.g., text CNA, which is taken from
the British Medical Journal), which I would consider as belonging to "applied science" or else simply
Language Learning & Technology
59
David Lee
Genres, Registers, Text Types, Domains, and Styles
"medicine" as a separate category. On the other hand, the BNC compilers appear to have a rather loose
definition "applied science." Anything which is not directly classifiable or recognisable as being purely
about theoretical physics, chemistry, biology or medicine is apparently considered "applied." For
example, consider
Text ID Medium Domain
Bibliographical Details
FYX
book
W_app_science Black holes and baby universes. Hawking, Stephen W. London:
Bantam (Corgi), 1993, pp. 1-139. 1927 s-units.
AMS
book
W_app_science Global ecology. Tudge, Colin. London: Natural History Museum
Pub, 1991, pp. 1-98. 1816 s-units.
AC9
book
W_app_science Science and the past. London: British Museum Press, 1991, pp. ??.
1696 s-units.
The first book is a popularisation by Stephen Hawking and is an application of physics to the study of the
universe or outer space. In the BNC Index genre scheme, I would consider this to be part of the "nonacademic natural sciences" genre (rather than "applied science"). It is a similar situation with the second
and third books (which concern ecology and archaeological/historical work, respectively). It is true that
these are also about applying scientific ideas in some way, but they do not quite fit in with the more
common understanding of "applied science." In the present scheme, text AMS would be under "academic:
natural science," and AC9 under "non-academic: humanities."
As another example of the classificatory system used here, consider the case of linguistics. Some
linguists, including myself, would consider our discipline to be a social science (although others would
place us in the humanities). In any case, consider the way the following BNC texts were (inconsistently)
classified by the compilers:
Text ID
B2X
Medium
periodical
CGF
book
EES
m_unpub
FAC
book
FAD
book
Domain
Details
W_app_science Journal of semantics. Oxford: OUP, 1990, pp. 321-452. 847 sunits.
W_arts
Feminism and linguistic theory. Cameron, Deborah. Basingstoke:
Macmillan Pubs Ltd, 1992, pp. 36-128. 1581 s-units.
W_app_science Large vocabulary semantic analysis for text recognition. Rose,
Tony Gerard. u.p., n.d., pp. ??. 2109 s-units.
W_soc_science Lexical semantics. Cruse, D A. Cambridge: CUP, 1991, pp. 1124. 2261 s-units.
W_soc_science Linguistic variation and change. Milroy, J. Oxford: Blackwell,
1992, pp. 48-160. 1339 s-units.
It may be the case that the actual content/topic of these linguistics-related texts makes them seem less like
social science texts than arts or applied science texts (e.g., text ESS is a dissertation on computer
handwriting recognition by a student from a department of computing,). But if so, what does it make of
the general public's understanding of domain labels like "linguistics" and "social sciences," then? These
are important questions when one is seeking to draw conclusions about the distribution of linguistic
features found in particular genres. For the present purposes, therefore, one particular stand has been
taken on how to classify texts, and readers should bear this in mind. (In the case of the above example, all
were classified as "academic: social science" except EES, which was put under "academic: technology
and engineering.")
What About Library Classificatory Codes?
At this point, some people may be wondering if the classification systems used by libraries might be of
use in helping us determine the proper genre labels. Atkins et al. (1992, p. 8) note in their discussion of
the corpus attribute topic that "It is necessary to draw up a list of major topics and subtopics in the
Language Learning & Technology
60
David Lee
Genres, Registers, Text Types, Domains, and Styles
literature. Library science provides a number of approaches to topic classification." This is an area that is
beyond my expertise and the scope of this article, but I will make a few brief comments here.27
Several library classification/cataloguing systems are in use all over the world. They are all principally
about subject areas (or topic) rather than about genre, although the two are, of course, related in many
cases. A familiar scheme, the Dewey Decimal Classification system, is shown in Table 7.
Table 7. Dewey Decimal Classification System
Classmark
Class 0
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
Class 7
Class 8
Class 9
[Broad Area] & Subject Areas
[GENERALITIES] Generalities; Catalogues; Newspapers; Computing
[PHILOSOPHY & PSYCHOLOGY] Philosophy; Psychology; The Mind;
[RELIGION] Religion
[SOCIAL SCIENCES] Social Sciences; Law; Government; Society; Commerce; Education;
[LANGUAGE] Linguistics; Scientific Study of Language
[NATURAL SCIENCES & MATHEMATICS] Pure Sciences; Mathematics; Physics;
Chemistry; Biology;
[TECHNOLOGY (APPLIED SCIENCES)] Applied Sciences; Engineering; Medicine;
Manufacturing;
[THE ARTS] The Arts; (Music, Drama etc.) Recreations; Hobbies;
[LITERATURE & RHETORIC] Literature
[GEOGRAPHY & HISTORY] Geography; History; Information about localities
In addition to the Classmark, however, library materials are also given keywords which generally consist
of Library of Congress subject headings (usually related to topic[s]). These are very useful when it comes
to finding out what a text is about (or, in the case of fiction texts, what a text is).28 In the case of literary
texts, actual genre labels are sometimes given as keywords, and a frighteningly large number of subgenres have been identified by the British Library cataloguers. These may prove useful to those who
desire detailed sub-genre information on literary texts. A few examples will suffice here: Adventure
stories, Detective and mystery stories, Picaresque literature, Robinsonades, Romantic suspense novels,
Sea stories, Spy stories, Thrillers, Allegories, Didactic fiction, Fables, Parables, Alternative histories,
Dystopias, Bildungsromane, Arthurian romances, Autobiographical fiction, Historical fiction, Satire,
Christmas stories, Medical novels, Folklore, Domestic fiction, Ghost stories, Horror tales, Magic realism,
Occult fiction, Feminist fiction, and Tall tales.
In addition to these fascinatingly categorised sub-genres,29 the library also includes "form headings,"
which are meant to "define a type of fiction in terms of specific presentation, provenance, intended
audience, form of publication."30 Examples include Young adult fiction, Children's stories, Readers
(Elementary), Plot-your-own stories, Diary fiction, Epistolary fiction, Movie novels, Scented books,
Glow-in-the-dark books, Toy and movable books, Graphic novels, Radio and television novels, Sound
effects books, Musical books, and Upside-down books.
As can be seen, therefore, library catalogues are a potentially valuable source of information as far as the
genre classification of fiction texts and the identification of subject topics in non-fiction texts are
concerned. Such information was, in fact, used in the process of creating the BNC Index, during the
manual stage of checking and correcting the initial genre classifications I had made.
Using the BNC Index
The BNC Index will be distributed in the Microsoft Excel® spreadsheet format as well as in a tabdelimited format (it will also be incorporated into two custom-built, user-friendly programs: see below).31
On a practical note, the advantage of using the Excel format is that there is a quick way of displaying only
the texts which match your chosen criteria through the use of the relatively user-friendly "Autofilter"
function (under the "Data" menu in the program, choose "Filter" and then "Autofilter"). With the
Autofilter switched on, the top row of every field (column) will have a drop-list which can be used to
Language Learning & Technology
61
David Lee
Genres, Registers, Text Types, Domains, and Styles
instantly filter down to the texts you want displayed (clicking on the drop-list button reveals all the
possible values for that field (e.g., genre), and you just select the one you want). Fields are combinable, so
you can, for example, first restrict the display to only "social science" texts under domain, then further
restrict this to only "periodicals" under medium, and end up with social science periodicals. It is also
possible to make more advanced searches, by activating the "Custom" filter dialogue box from the
relevant drop-list. This will allow you to filter the fields using wildcards. One caveat needs to be issued to
users, however: They should not rely entirely on the genre labels, but should also check the "Alternative
Notes" column and scan/browse the files, too. For example, texts labelled "S_brdcast_discussion" also
contain news reportage (in between the broadcast talk shows/programmes). This is unavoidable, since
some BNC files combine genres and sub-genres and can only be labelled in terms of the majority type.
Some of the BNC-supplied fields are also not entirely accurate. Many of the files which are coded as
"monologue" (under the Interaction Type column), for example, actually include some dialogue as well
(i.e., they are mostly monologue, but not exclusively).
A stand-alone Windows® program, called BNC Indexer®, has been developed by Antonio Moreno Ortiz
using the information contained in my spreadsheet.32 A web-based facility, BNC Web Indexer, is also
being developed at Lancaster, which does essentially the same thing.33 Both programs are similar in
layout and function. They are much easier to use than the Excel spreadsheet since they do not require any
knowledge of spreadsheet/database programs and have very simple, intuitive interfaces (perfect for
classroom situations). All the information fields (domain, genre, audience age, author sex, etc.) and their
values are displayed on screen and users simply select the values they want to use and then press a button
to execute the query. A results panel shows all the texts which match the filtering criteria, along with
bibliographical and other information. (With BNC Indexer, individual texts can also be deselected from
the output list if so desired, and can be browsed first by double-clicking on the relevant line.) Output file
lists containing the file IDs of the BNC files which matched the criteria can be generated and fed into
concordancers such as WordSmith or MonoConc,34 which can use a list of filenames to specify a subcorpus to which future queries are to be restricted. Note that with both BNC Indexer and BNC Web
Indexer, individual files can always be deleted from the output list if so desired, so users do not have to
accept the classification decisions wholesale but can vet individual texts before allowing them into a subcorpus.
It is beyond the scope of the present article to give more practical instructions or examples on how to use
the BNC Index spreadsheet or the Indexer programs. Users will, in any case, surely find their own
favourite ways of doing things, or may visit the relevant web sites for further information.
THE USES OF GENRE
In this paper, I have examined the different usages of the terms genre, text type, register, domain, style,
and so forth. Which of these concepts is most useful for researchers, or for teachers to use in the context
of classroom concordancing? I suggest that it is fruitful to start by looking at genres (categories of texts),
and end up by generalising (through induction) about the existence of registers (linguistic characteristics)
or even "text types" in Biber's sense (categories of texts empirically based on linguistic characteristics).
The work by Carne (1996), Cope & Kalantzis (1993), Flowerdew (1993), Hopkins & Dudley-Evans
(1988), Hyland (1996), Lee (in press), McCarthy (1998a, 1998b), Thompson (in press), and Tribble
(1998, 2000), to name but a few, show how a genre-based approach to analysing texts can yield
interesting linguistic insights and may be pedagogically rewarding as well. Thompson's paper, for
example, shows how genre-based cross-linguistic analyses of travel brochures and job advertisements can
reveal subtle, linguistically-coded differences in culture and point of view. Such genre analyses of
relatively small, focussed and manageable sets of texts are now possible with the help of the BNC Index,
opening up a rich resource for all kinds of learning and research activities. By searching for keywords in
the various database fields, teachers and researchers can now quickly find even such rare sub-genres as
Language Learning & Technology
62
David Lee
Genres, Registers, Text Types, Domains, and Styles
postcards, lecture notes, shopping lists and school essays ("rare" in the sense that they were not included
in previous-generation general corpora and are hard to get hold of in machine-readable format even
nowadays).
The personal BNC Index project described here is an attempt at classifying the corpus texts into genres or
super-genres, and putting this and other types of information about the texts into a single, informationrich, user-friendly resource. This Index may be used to navigate through the mass of texts available. Users
can then see at once how many texts there are that match certain criteria, and the total number of words
they constitute. In this way, sub-corpora can then be easily created for specialised research or
teaching/learning activities (e.g., it is now easy to retrieve BNC texts for ESP lessons to do with law,
medicine, physics, engineering, computing, etc.).
Ultimately, one would wish that a deeper understanding of genres (their forms, structures, patterns) would
be a "transformative" exercise for all investigators. As Cranny-Francis (1993) says,
Genre is a category which enables the individual to construct critical texts; by
manipulating genre conventions to produce texts which engender [critical analysis.] It
also enables, therefore, the construction of a new, different consciousness …
A concept of genre allows the critic or analyst to explore [the] complex relationships in
which a text is involved, relationships which ultimately relate back to what a text means.
This is because what a text says and how it says it cannot be separated; this is
fundamental to our notion of genre. Because of this, genre provides the link between text
and context; between the formal and semantic properties of texts; between the text and
the intertextual, disciplinary and technological practices in which it is embedded. (pp.
111-113)
I hope that the disparate users and potential users of the BNC, whether researchers, teachers or students,
will find the genre-enhanced BNC Index useful for all kinds of linguistic enquiry, and that some of the
above transformative goals will be realised for them.
APPENDIX A
SPOKEN BNC Sampler: Missing or Unrepresentative Genres
• Consultations: medical (none)
• Consultations: legal (none)
• Classroom discourse (only 3 texts)
• Public debates (only 3 texts)
• Job interviews (none)
• Parliamentary debates (none)
• News broadcasts (none)
• Legal presentations (there are 2 legal cross-examinations, but no presentations, i.e., monologues)
• University lectures (none)
• Telephone conversations (no pure telephone conversations in the BNC as a whole)
• Sermons (only 1 text)
• Live sports discussions (none)
• TV/radio discussions (only 4 texts)
• TV documentaries (only 2 texts)
Language Learning & Technology
63
David Lee
Genres, Registers, Text Types, Domains, and Styles
WRITTEN BNC Sampler: Missing or Unrepresentative Genres
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Academic prose: humanities (none)
Academic prose: medicine (none)
Academic prose: politics, law and education (only 2 texts on law, none on politics or education)
Academic prose: natural sciences (nothing on chemistry, only 1 on biology & 3 on physics)
Academic prose: social sciences (nothing on the core subject areas of sociology or social work, nor on
linguistics, which is arguably a social science, even though it is often treated as a humanities subject)
Academic prose: technology & engineering (nothing on engineering)
Administrative prose (only 1 text)
Advertisements (none)
Broadsheets: the only broadsheet material included consisted entirely of foreign news, and only from
the Guardian.
Broadsheets: sports news (none)
Broadsheets: editorials and letters (none)
Broadsheets: society/cultural news (none)
Broadsheets: business & money news (none)
Broadsheets: reviews (none)
Biographies (none)
E-mail discussions (none)
Essays: university (only 1 text)
Essays: school (none)
Fiction: Drama (only 1 text)
Fiction: Poetry (only 2 texts)
Fiction: Prose (insufficient texts, and only 1 short story)
Parliamentary proceedings/Hansard (none)
Instructional texts (none)
Personal letters (none)
Professional letters (none)
News scripts (only 1 radio sports news script)
Non-academic: humanities (only 2 texts)
Non-academic: medicine (none)
Non-academic: pure sciences (none)
Non-academic: social sciences (2 rather odd texts, and 1 which possibly could be non-academic)
Non-academic pure science material (i.e. popularisations of science texts: there were none of these in
the Sampler)
News scripts (classified as 'written-to-be-spoken' in the main BNC. None included in the Sampler)
Official documents (only 1 text)
Tabloid newspapers (only Today and East Anglian Daily Times, the latter of which is not really a
tabloid, but a regional newspaper)
Language Learning & Technology
64
David Lee
Genres, Registers, Text Types, Domains, and Styles
APPENDIX B
Information Fields and Possible Values in the BNC Index (the abbreviations/codes are in bold)
Field
Medium
Domain
Genre
(70 in total)
Possible Values
[Written texts only]
book, m_pub (miscellaneous, published), m_unpub (miscellaneous unpublished),
periodical (magazines, journals, etc.), to_be_spoken (written-to-be-spoken)
S_cg_business (context-governed, business), S_cg_education (c-g, educational),
S_cg_leisure (c-g, leisure), S_cg_public (c-g, public/institutional),
S_Dem_AB/C1/C2/DE/Unc (spoken demographic classes for the casual
conversation files; 'Unc' = 'unclassified'), W_app_science (applied science),
W_arts, W_belief_thought (belief & thought), W_commerce (commerce &
finance), W_imaginative (imaginative/creative), W_leisure (leisure),
W_nat_science (natural sciences), W_soc_science (social sciences),
W_world_affairs (world affairs).
[Spoken texts, 24 genres]:
S_brdcast_discussn (TV or radio discussions), S_ brdcast_documentary (TV
documentaries), S_brdcast_news (TV or radio news broadcasts), S_classroom
(non-tertiary classroom discourse), S_consult (mainly medical & legal
consultations), S_conv (face-to-face spontaneous conversations), S_courtroom
(legal presentations or debates), S_demonstratn ('live' demonstrations), S_interview
(job interviews & other types), S_interview_oral_history (oral history
interviews/narratives, some broadcast), S_lect_commerce (lectures on economics,
commerce & finance), S_lect_humanities_arts (lectures on humanities and arts
subjects), S_lect_ nat_science (lectures on the natural sciences),
S_lect_polit_law_edu (lectures on politics, law or education), S_lect_soc_ science
(lectures on the social & behavioural sciences), S_meeting (business or committee
meetings), S_parliament (BNC-transcribed parliamentary speeches), S_pub_debate
(public debates, discussions, meetings), S_sermon (religious sermons),
S_speech_scripted (planned speeches), S_speech_unscripted (more or less
unprepared speeches), S_sportslive ('live' sports commentaries and discussions),
S_tutorial (university-level tutorials), S_unclassified (miscellaneous spoken
genres).
[Written texts, 46 genres]
W_ac_humanities_arts (academic prose: humanities), W_ac_medicine (academic
prose: medicine), W_ac_nat_ science (academic prose: natural sciences),
W_ac_polit_law_edu (academic prose: politics, laws, education), W_ac_soc_
science (academic prose: social & behavioural sciences), W_ac_tech_engin
(academic prose: technology, computing, engineering), W_admin (adminstrative
and regulatory texts, in-house use), W_advert (print advertisements), W_biography
(biographies/autobiographies), W_commerce (commerce & finance, economics),
W_email (e-mail sports discussion list), W_essay_school (school essays),
W_essay_univ (university essays), W_fict_drama, W_fict_poetry, W_fict_prose
(drama, poetry and novels), W_hansard (Hansard/parliamentary proceedings),
W_institut_doc (official/govermental documents/leaflets, company annual reports,
etc.; excludes Hansard), W_instructional (instructional texts/DIY),
W_letters_personal, W_letters_prof (personal and professional/business letters),
W_misc (miscellaneous texts), W_news_script (TV autocue data),
W_newsp_brdsht_nat_arts (broadsheet national newspapers: arts/cultural
Language Learning & Technology
65
David Lee
Mode
Author age
Author sex
Author type
Audience age
Audience sex
Audience level
Sampling
Circulation Status
Genres, Registers, Text Types, Domains, and Styles
material), W_newsp_brdsht_nat_commerce (broadsheet national newspapers:
commerce & finance), W_newsp_brdsht_nat_editorial (broadsheet national
newspapers: personal & institutional editorials, & letters-to-the-editor),
W_newsp_brdsht_nat_misc (broadsheet national newspapers: miscellaneous
material), W_newsp_brdsht_nat_report (broadsheet national newspapers: home &
foreign news reportage), W_newsp_brdsht_nat_science (broadsheet national
newspapers: science material), W_newsp_brdsht_nat_social (broadsheet national
newspapers: material on lifestyle, leisure, belief & thought),
W_newsp_brdsht_nat_sports (broadsheet national newspapers: sports material),
W_newsp_other_arts (regional and local newspapers),
W_newsp_other_commerce, W_newsp_other_report, W_newsp_other_science,
W_newsp_other_social, W_newsp_other_sports, W_newsp_tabloid (tabloid
newspapers), W_non_ac_humanities_arts (non-academic/non-fiction: humanities),
W_non_ac_medicine (non-academic: medical/health matters),
W_non_ac_nat_science (non-academic: natural sciences),
W_non_ac_polit_law_edu (non-academic: politics, law, education),
W_non_ac_soc_ science (non-academic: social & behavioural sciences),
W_non_ac_tech_engin (non-academic: technology, computing, engineering),
W_pop_lore (popular magazines), W_religion (religious texts, excluding
philosophy).
W (written), S (spoken)
0-14 yrs (band 1), 15-24 yrs (band 2), 25-34 yrs (band 3), 35-44 yrs (band 4), 45-59
yrs (band 5), 60+ yrs (band 6), --- (unclassified)
Male, Female, Mixed, Unknown, --- (not applicable/available)
Corporate, Multiple, Sole, Unknown/unclassified
child, teen, adult, --- (unclassified)
male, female, mixed, --- (unclassified)
low (level 1), medium (level 2), high (level 3), --- (unclassified)
whole text (whl), beginning sample (beg), middle sample (mid), end sample (end),
composite (cmp), unknown/not applicable (--).
(formerly "reception status"): Low, Medium, High (blank for unclassified texts)
NOTES
1. In contrast, Nuyts (1988) uses "text type" in a rather idiosyncratic way to mean "a variety of written
text" (as opposed to "conversation type" for spoken texts). Many other people similarly use "text
type" in a rather loose way to mean "register" or "genre."
2. EAGLES is the Expert Advisory Group on Language Engineering Standards, an initiative set up by
the European Union to create common standards for research and development in speech and natural
language processing. At present, most EAGLES documents take the form of preliminary guidelines
from which it is hoped that standards will later emerge.
3. In Biber's (1989) article on text typology, the nature of his "internal criteria" are more clearly shown.
His "text types" are groupings of texts based on statistical clustering procedures which make use of
co-occurrence patterns of surface-level linguistic features.
4. Wikberg (1992, p. 248) calls these rhetorical types "discourse categories" (German Texttyp), as
opposed to "text types" (German Textsorte) which is equivalent to what I am here calling genres.
Language Learning & Technology
66
David Lee
Genres, Registers, Text Types, Domains, and Styles
5. The GeM project at Stirling University illustrates an interesting new usage of genre. As it says on
their Web site, "The GeM project analyses expert knowledge of page design and layout to see how
visual resources are used in the creation of documents, both printed and electronic. The genre of a
page -- whether it's an encyclopaedia entry, a set of instructions, or a Web page -- plays a central role
in determining what graphical devices are chosen and how they are employed …. The overall aim of
the project is to deliver a model of genre [italics added], layout, and their relationship to
communicative purpose for the purposes of automatic generation of possible layouts across a range of
document types, paper and electronic."
6. This diagram is from Martin (in press), but a similar one may be found in Eggins & Martin (1997, p.
243).
7. On a more speculative note, we could perhaps borrow from the tagmemic/particle physics perspective
and talk in terms of particles (registers), waves (styles) and fields (genres). (Mike Hoey, personal
communication.)
8. Martin (1993, 121) uses the term "macro-genre" to mean roughly the same thing.
9. Also, face-to-face conversations do not, arguably, form a proper genre as such (cf. Swales, 1990).
However, for many research purposes, they form a coherent, useful super-genre.
10. Perhaps "religion" could also be considered a very broad content or topic label (?). In any case, this
exceptional category apparently came about due to the unique nature of the texts: the corpus
compilers note that the texts could "embrace any of the stylistic characteristics of [several other LOB
categories]," yet they all belonged together in some sense. All "committed religious writing" was
therefore put together under "Religion" (cf. Johansson, Leech, & Goodluck, 1978, 16).
11. As the EAGLES (1996) authors say, where there is a division into "factual" (informative) vs.
"fictional" (imaginative), then "to avoid controversy, religious works are given a separate category of
their own" (p.8).
12. Available on the Web at ftp://ftp.itri.bton.ac.uk/pub/bnc/bib-dbase. Titles of files in this resource are
truncated to the first 80 characters, which limits its usefulness for some purposes.
13. The quote also contains an example of the term text types being used in a non-technical/loose fashion
to mean "types/varieties of text."
14. Kilgarriff's list only includes the first 80 characters or so of the title of each file, which means some
titles are truncated (thus no good for searching by), and author names (for the written texts) are not
included.
15. COPAC is an on-line system for unified access to the (combined) catalogues of some of the largest
university research libraries in the UK and Ireland. Keywords were manually copied from the Web
catalogue entries and put into a separate column in the BNC Index to allow researchers to search by
proper library keywords in addition to the keywords provided by the BNC compilers. These keywords
will greatly facilitate the identification of sub-genres, (sub-)topics, etc., by people who wish to have
finer sub-classifications for specific research purposes.
16. For an explanation of why only non-fiction works are given keywords, see note 28.
17. Note that for the demographic files (conversations) the Keywords field is empty for almost all the
files.
18. The somewhat confusing term reception status is used in the BNC Users' Reference Guide instead of
circulation status. Since it refers to the size of the readership or the circulation level (not the social
status of the text), I have changed the label to reflect this. Circulation status should be used with
Language Learning & Technology
67
David Lee
Genres, Registers, Text Types, Domains, and Styles
caution, because it is relative to genre: A newspaper with "low" reception status may still have a lot
more readers than a "medium-reception" book of poetry or office memo. The field (Target) Audience
level, on the other hand, is an estimate (by the compilers) of the level of difficulty of the text, or the
amount of background knowledge of its subject matter which is assumed.
19. Note that Genre classifications (assigned by me) do not always agree with the Domain classifications
of the BNC compilers (i.e., the official domain classifications as given in the standard distribution of
the corpus).
20. This follows the new 4-way classification scheme employed in the BNC World Edition: alltim0 (--[unclassified]); alltim1 (1960-1974); alltim2 (1975-1984); alltim3 (1985-1994).
21. Using "audience level=high" will roughly filter out introductory textbooks and texts written for both
an academic and a more general audience.
22. Some of the genre names in the actual spreadsheet are further abbreviated for practical reasons.
23. Note that, in addition, there are four BNC files (EUY, HD6, KA2, KAV) which contain a roughly
even mix of poetry and prose. These have been placed under the "W_misc" genre.
24. The LOB corpus already has, of course, a modern-day correlative: the FLOB (Freiburg LOB) corpus.
My categorisations will allow the BNC to also be used in comparative studies using these corpora.
25. People who disagree with these classifications may use the "Keywords" and "Title" fields to find the
relevant files and re-classify them as desired.
26. The domain labels in the BNC Index are largely unchanged (i.e., they reflect the decisions of the
BNC compilers). Some egregious errors were corrected, however, and reported to the BNC project
for fixing in the new release, BNC World Edition.
27. The British Library Web site (http://www.bl.uk) offers some detailed information & links.
28. A British Library "Fiction Indexing Policy" document states, "When indexing non-fiction it is right to
attempt to express what the work as a whole is about, since it is usual for non-fiction to focus on one
or more specific topics. By contrast, a work of fiction is rarely 'about' a topic at all. Instead, most
works of fiction contain within them subjects as themes or settings. What they are 'about' is conveyed
in the story as a whole. It is only themes, settings and characters which can be picked out easily by
means of subject headings" (see http://www.bl.uk/services/bsds/nbs/marc/655polc.html).
29. As the EAGLES (1996) authors further point out, there are "alarming possibilities of double
classification [i.e., mixed genres] -- spy thriller, historical romance, etc."
30. From the document at http://www.bl.uk/services/bsds/nbs/marc/655list2.html, which also gives a full
listing of the literary sub-genres identified by the British Library.
31. The BNC Index spreadsheet, when ready, will be distributed initially at
http://members.xoom.com/davidlee00/corpus_resources.htm. Suggestions for hosting on other sites
are welcome.
32. Available at http://personal5.iddeo.es/tone/BNCIndexer. It is priced at 50 Euros for either an
individual or institutional licence (up to 15 users).
33. BNC Web Indexer is the result of a collaboration between Paul Rayson (UCREL, Lancaster
University) and myself. The URL will be announced on the CORPORA and CLLT (Corpus
Linguistics and Language Teaching) mailing lists when available.
Language Learning & Technology
68
David Lee
Genres, Registers, Text Types, Domains, and Styles
34. Or using the Web-based concordancer for the BNC developed at Zürich, BNCweb, at
http://escorp.unizh.ch (restricted usage). The new version of SARA developed for the BNC World
Edition is also expected to have more sophisticated sub-corpus querying facilities.
ABOUT THE AUTHOR
David YW Lee recently completed his doctoral studies at Lancaster University and is currently a visiting
researcher and part-time tutor there. His PhD research involved applying Douglas Biber's
multidimensional (MD) methodology to fresh spoken and written data from the British National Corpus
(BNC) and a consequent critique of that factor-analysis-based methodology. At present, he is working on
publishing his findings as a book, and is writing various articles for journals.
E-mail: david_lee00@hotmail.com
REFERENCES
Atkins, S., Clear, J., & Ostler, N. (1992). Corpus Design Criteria. Journal of Literary and Linguistic
Computing, 7(1), 1-16.
Bhatia, V. (1993). Analysing genre: Language use in professional settings. London: Longman.
Biber, D. (1988). Variation across speech and writing. Cambridge, UK: Cambridge University Press.
Biber, D. (1989). A typology of English texts. Linguistics, 27(1), 3-43.
Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243-257
Biber, D. (1994). An analytical framework for register studies. In D. Biber & E. Finegan (Eds.),
Sociolinguistic perspectives on register (pp. 31-56). New York: Oxford University Press.
Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge, UK:
Cambridge University Press.
Biber, D. & Finegan, E. (1986). An initial typology of text-types. In J. Aarts & W. Meijs (Eds.), Corpus
linguistics II (pp. 19-46). Amsterdam: Rodopi.
Biber, D., & Finegan, E. (1989). Drift and the evolution of English style: A history of three genres.
Language, 65, 487-517.
Burnard, L. (Ed.). (1995, April 25). The British national corpus users reference guide (SGML version,
First release with version 1.0 of BNC). Oxford, UK: Oxford University Computing Services.
Carne, C. (1996). Corpora, genre analysis and dissertation writing: An evaluation of the potential of
corpus-based techniques in the study of academic writing. In S. Botley, J. Glass, T. McEnery, & A.
Wilson (Eds.), Proceedings of teaching and language corpora 1996, UCREL Technical Papers Vol. 9
(pp. 127-137). Lancaster, UK: Lancaster University.
Cope, B., & Kalantzis, M. (1993). Introduction: How a genre approach to literacy can transform the way
writing is taught. In B. Cope & M. Kalantzis (Eds.), The powers of literacy: A genre approach to teaching
writing (pp. 1-21). London: Falmer Press.
Cope, B., & Kalantzis, M. (Eds.). (1993). The powers of literacy: A genre approach to teaching writing.
London: Falmer Press.
Couture, B. (1986). Effective ideation in written text: A functional approach to clarity and exigence. In B.
Couture (Ed.), Functional approaches to writing: Research perspectives (pp. 69-91). Norwood, NJ:
Ablex.
Language Learning & Technology
69
David Lee
Genres, Registers, Text Types, Domains, and Styles
Cranny-Francis, A. (1993). Genre and gender: Feminist subversion of genre fiction and its implications
for cultural literacy. In B. Cope & M. Kalantzis (Eds.), The powers of literacy: A genre approach to
teaching writing (pp. 116-136). London: Falmer Press.
Crombie, W. (1985). Discourse and language learning: A relational approach to syllabus design.
Oxford, UK: Oxford University Press.
Crystal, D., & Davy, D. (1969). Investigating English style. London: Longman.
Crystal, D. (1991). A dictionary of linguistics and phonetics. Oxford, UK: Basil Blackwell.
Expert Advisory Group on Language Engineering Standards. (1996, June). Preliminary recommendations
on text typology. EAGLES Document EAG-TCWG-TTYP/P. [Available at
http://www.ilc.pi.cnr.it/EAGLES96/texttyp/texttyp.html]
Eggins, S., & Martin, J. R.. (1997). Genres and registers of discourse. In T. van Dijk, (Ed.), Discourse as
structure and process (pp. 230-56). London: Sage.
Faigley, L., & Meyer, P. (1983). Rhetorical theory and readers' classifications of text types. Text, 3, 305325.
Fairclough, N. (1992). Discourse and social change. Cambridge, UK: Polity Press.
Fairclough, N. (2000). New labour, new language? London: Routledge.
Ferguson, C. (1994). Dialect, register and genre: Working assumptions about conventionalization. In D.
Biber & E. Finegan (Eds.), Sociolinguistic perspectives on register (pp. 15-30). New York: Oxford
University Press.
Finegan, E., & Biber, D. (1994). Register and social dialect variation: An integrated approach. In D. Biber
& E. Finegan (Eds.), Sociolinguistic perspectives on register (pp. 315-347). New York: Oxford
University Press.
Flowerdew, J. (1993). An educational or process approach to the teaching of professional genres. ELTJ,
47, 4305-4316.
Grishman, R., & Kittredge, R. (Eds.). (1986). Analyzing language in restricted domains: Sublanguage
description and procesing. Hillsdale, NJ: Lawrence Erlbaum.
Halliday, M. A. K., & Hasan, R. (1985). Language context and text: Aspects of language in a socialsemiotic perspective. Oxford, UK: Oxford University Press.
Hammond, J., Burns, A., Joyce, H., Brosnan, D., & Gerot, L. (1992). English for social purposes: A
handbook for teachers of adult literacy. Sydney: National Centre for English Language Teaching and
Research, Macquarie University.
Hoey, M. (1983). On the surface of discourse. London: Allen and Unwin.
Hoey, M. (1986). Clause relations and the writer's communicative task. In B. Couture (Ed.), Functional
approaches to writing: Research perspectives (pp. 120-141). Norwood, NJ: Ablex.
Hopkins, A., & Dudley-Evans, T. (1988). A genre-based investigation of the discussion sections in
articles and dissertations. English for Specific Purposes, 7, 113-121.
Hyland, K. (1996). Talking to the academy: Forms of hedging in scientific research articles. Written
Communication, 13(2), 251-282.
Johansson, S., Leech, G., & Goodluck, H. (1978). Manual of information to accompany the LancasterOslo/Bergen corpus of British English, for use with digital computers. Oslo: Department of English,
University of Oslo.
Language Learning & Technology
70
David Lee
Genres, Registers, Text Types, Domains, and Styles
Joos, M. (1961). The five clocks. New York: Harcourt Brace & World.
Kennedy, G. (1998). An introduction to corpus linguistics. London: Longman.
Kittredge, R., & Lehrberger, J. (Eds.). (1982). Sublanguage: Studies of language in restricted semantic
domains. Berlin: Walter de Gruyter.
Kress, G. (1993). Genre as social process. In Cope, B., & Kalantzis, M. (Eds.), The powers of literacy: A
genre approach to teaching writing (pp. 22-37). London: Falmer Press.
Kress, G., & Hodge, R. (1979). Language as ideology. London: Routledge & Kegan Paul.
Lee, David Y. W. (2000). Modelling variation in spoken and written language: The multi-dimensional
approach revisited. Unpublished doctoral dissertation, Lancaster University.
Lee, David Y. W. (in press). Defining core vocabulary and tracking its distribution across spoken and
written genres: Evidence of a gradience of variation from the British National Corpus. Journal of English
Linguistics.
Martin, J. R. (in press). Cohesion and texture. Manuscript submitted for publication.
Martin, J.R. (1993). A contextual theory of language. In Cope, Bill & Mary Kalantzis (Eds.), The Powers
of Literacy: a genre approach to teaching writing (pp. 116-136). London: Falmer Press.
McCarthy, M. (1998a). Taming the spoken language: Genre theory and pedagogy. The Language
Teacher, 22(9). Retrieved June 20, 2000 from the World Wide Web:
http://langue.hyper.chubu.ac.jp/jalt/pub/tlt/98/sep/mccarthy.html.
McCarthy, M. (1998b). Spoken language and applied linguistics. Cambridge, UK: Cambridge University
Press.
Meyer, B. (1975). The organisation of prose and its effects on recall. New York. North Holland.
Nakamura, J. (1986). Classification of English texts by means of Hayashi's Quantification Method Type
III. Journal of Cultural and Social Science, 21, 71-86.
Nakamura, J. (1987). Notes on the use of Hayashi's Quantification Method Type III for classifying
English texts. Journal of Cultural and Social Science, 22, 127-145.
Nakamura, J. (1992). Hayashi's Quantification Method Type III: A tool for determining text typology in
large corpora. An annex to a general report on annotation tools of the NERC Report. Unpublished
manuscript.
Nakamura, J. (1993). Statistical methods and large corpora: A new tool for describing text types. In M.
Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology: In honour of John Sinclair (pp. 293312). London: John Benjamins.
Nuyts, J. (1988). IPrA survey of research in progress. Wilrijk, Belgium: International Pragmatics
Association.
Paltridge, B. (1995). Working with genre: A pragmatic perspective. Journal of Pragmatics, 23, 393-406.
Paltridge, B. (1996). Genre, text type, and, and the language classroom. ELT Journal, 50(3), 237-243.
Paltridge, B. (1997). Genre, frames and writing in research settings. Amsterdam: John Benjamins.
Phillips, M. A. (1983). Lexical macrostructure in science text. Unpublished doctoral dissertation,
University of Birmingham, UK.
Rosch, E. (1973a). On the internal structure of perceptual and semantic categories. In T. E. Moore, (Ed.),
Cognitive development and the acquisition of language (pp. 111-144). New York: Academic Press.
Language Learning & Technology
71
David Lee
Genres, Registers, Text Types, Domains, and Styles
Rosch, E. (1973b). Natural categories. Cognitive Psychology, 4, 328-350.
Rosch, E. (1978). Principles of categorisation. In E. Rosch, & B. Lloyd (Eds.), Cognition and
categorisation. Hillside, NJ: Lawrence Erlbaum.
Sampson, J. (1997). "Genre," "style" and "register". Sources of confusion? Revue Belge de Philologie et
d'Histoire, 75(3), 699-708.
Steen, G. (1999). Genres of discourse and the definition of literature. Discourse Processes, 28, 109-120.
Stubbs, M. (1996). Text and corpus analysis: Computer assisted studies of language and culture. Oxford,
UK: Blackwell.
Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge, UK:
Cambridge University Press.
Taylor, J. R. (1989). Linguistic categorisation: Prototypes in linguistic theory. Oxford, UK: Clarendon.
Thompson, G. (in press). Corpus, comparison, culture: Doing the same things differently in different
cultures. In M. Ghadessy, R. Roseberry, & A. Henry (Eds.), The use of small corpora in language
teaching. Manuscript submitted for publication.
Tribble, C. (1998). Writing difficult texts. Unpublished doctoral dissertation, Lancaster University.
Tribble, C. (2000). Genres, keywords, teaching: towards a pedagogic account of the language of Project
Proposals. In L. Burnard, & T. McEnery (Eds.), Rethinking language pedagogy from a corpus
perspective: Papers from the third international conference on teaching and language corpora (Lodz
Studies in Language; pp. 75-90). Hamburg: Peter Lang. Retrieved June 20, 2000 from the World Wide
Web: http://ourworld.compuserve.com/homepages/Christopher_Tribble/Genre.htm.
van Dijk, T. (Ed.). (1985). Handbook of discourse analysis. London: Academic Press.
Wikberg, K. (1992). Discourse category and text type classification: procedural discourse in the Brown
and the LOB corpora. In Leitner, Gerhard (Ed.), New directions in English language corpora:
Methodology, results, software developments (pp. 247-261). Berlin: Mouton de Gruyter.
Language Learning & Technology
72
Language Learning & Technology
http://llt.msu.edu/vol5num3/aston/
September 2001, Vol. 5, Num. 3
pp. 73-76
TEXT CATEGORIES AND CORPUS USERS:
A RESPONSE TO DAVID LEE
Guy Aston
University of Bologna, Italy
In designing any corpus, it is necessary to decide what types of texts to include, and how many of
each type. (I use the term "text type" as a neutral one which does not imply any specific theoretical
stance.) The British National Corpus (Burnard, 1995) made an initial division into written texts and
spoken ones (i.e. transcripts of recordings), and within each of these macrocategories, employed
further categorisations and subcategorisations. For the spoken component, a first distinction was
between "demographic" (conversations: 153 texts) versus "context-governed" (speech recorded in
particular types of setting: 757 texts), and the "context-governed" component was further divided
according to the nature of the setting (educational/informative; business; public/institutional; leisure:
from 131 to 262 texts in each), paralleled by a monologue/dialogue distinction (40%/60%). For the
written component, two principal parallel categorisations were used: "domain" (i.e., subject matter,
divided into nine classes, viz., imaginative; arts; belief and thought; commerce; leisure; natural
science; applied science; social science; world affairs: from 146 to 527 texts in each) and "medium"
(five classes, viz., book; periodical; miscellaneous published; published; to-be-spoken: from 35 to
1,414 texts in each). All figures refer to the BNC World Edition (2001).
Text categorisations, as Lee notes, are generally based on "external" criteria -- where/when the text
was produced, by/for who, what it is about -- rather than "internal" ones based on its linguistic
characteristics. The categorisations used in corpus design tend to be broad rather than delicate, since
what corpus designers want to do is to enable users to generalize about and compare different
categories. To generalize with any confidence, each category must contain a substantial number of
different texts, so that no one text exerts an undue influence on that category (early corpora such as
Brown and Lob, which were relatively small, got around this problem by including very short samples
from a large number of texts); and each category must contain a wide variety of different texts, so
that no one subcategory exerts an undue influence on that category as a whole (Biber, 1993): the
greater the variance within a category, the more texts will be needed in order to document that
variance. Thus, it may be decided to include roughly equal numbers of texts from different parts of
the country, by authors of different sexes/ages or from different types of settings. Within the BNC
"context-governed" component, for instance, the "educational/informative" category was designed
to include lectures, talks, classroom interaction, and news commentaries, drawing these from
different types of institutions in different areas and with a wide range of speakers and topics.
Since corpora cannot be infinite, the delicacy of the categorisations to be employed is largely
determined by practical considerations. The BNC, which contains just over 4,000 texts, uses a
framework which guarantees at least 100 texts in most principal categories. You may or may not like
the categories chosen, but the corpus arguably allows you to generalize about these categories -- about
spoken and written texts, the nine different domains of written texts, the four different domains of
"context-governed" spoken texts, and so forth -- with reasonable certainty that findings will not be
unduly biased by any particular text or any particular subcategory of texts. These categories are
indicated in the headers to individual texts as attributes of the <catRef> element, using which it is
possible to restrict queries to a particular category or combination of categories. A number of errors
of categorisation in the first release of the BNC have been corrected in the World Edition (2001).
Users may, however, want to employ different categorisations from those employed by the corpus
designers. David Lee provides one such categorisation, and the latest version of the SARA software
(SARA98; Dodd, 2000) allows users to create their own subcorpora from the full BNC using his, or
Copyright 2001, ISSN 1094-3501
73
Guy Aston
Text Categories and Corpus Users…
other, categories (Aston, in press). Users should, however, be aware that such categories may be
poorly represented in the corpus, both numerically and in terms of their variance. The more delicate
the categorisation employed, the more likely it is that this will be the case (Sinclair, 1991) -- but
even where a categorisation appears relatively broad, not all its members may be adequately
documented. Thus Lee divides the BNC's imaginative written texts into novels, poems, and drama.
However, there are only two texts in the BNC which fall into his drama category, so it would be
pretty unwise to generalize about drama on their evidence. Why aren't there more? Some drama was
included in the BNC in order to capture variance within the category of imaginative writing, but the
quantity of drama is the result of decisions concerning the relative weight of drama in this category,
just as the quantity of imaginative writing in the corpus is the result of decisions concerning the
weight of imaginative writing in contemporary text production and reception as a whole. To include
more drama would have either meant changing these design decisions or increasing the size of the
corpus by an analogous factor.
All this means that if you want to generalize about contemporary British drama (or indeed about
many of Lee's many other text categories), you would do much better to compile your own
specialized corpus (though you may want to compare your findings with the BNC in order to see
whether the features you identify are specific to the text-type in question). But you can't really
complain about the BNC just because it doesn't contain more texts in a particular specialized category
you happen to be interested in, whether this be e-mails, lectures, or business letters. That isn't what
general mixed reference corpora are designed for, and you would clearly do better to start from a text
archive instead, or from the Web.
But isn't a categorisation like Lee's what many users would like, and shouldn't the BNC have used such
a categorisation to determine its composition? The main problem with Lee's approach, based on
what he considers "prototypical" genres, is that it does not consider either the weight of these genres
in the culture (in particular their frequency of reception and production), or the variance to be found
within them. Lee appears to think that the BNC really ought to have provided representative
samples for all 70 of his mutually-exclusive categories. But in order to include a minimum of, say, 50
texts in each category, either the corpus would have to have been very much larger, or else it would
have had to weight these categories more or less equally (70 x 50 = 3,500: the BNC contains just
over 4,000 texts). Lee's three genres of imaginative writing (novels, poetry, and drama) would hardly
seem to have the same frequency and variance within British culture, where much more fiction is read
and published than poetry or drama, and, I suspect, of many more different kinds. So why should the
corpus include the same amount of each?
Or take prayers. For some reason, prayers aren't one of Lee's genres, though I would have thought
them as good a candidate for prototypical status as sermons, which are. There is only one text of
prayers in the BNC, falling into the to-be-spoken written medium category (and into the belief and
thought domain). The same to-be-spoken category, on the other hand, contains no fewer than 32
texts of television and radio news scripts. This disproportion seems fair enough when judged by
production and reception standards -- news broadcasts play a much bigger part in British text
production and reception than prayers do, alas. Yet, Lee's argument would suggest that they ought to
have similar weighting, insofar as they have similar prototypical status (or else that the corpus should
be much, much larger).
Lee's criticisms seem particularly unwarranted as far as the BNC Sampler (1999) is concerned (for
the record, this contains no prayers, only one drama text, and only one news script). The Sampler -which, like sampler music CDs, was designed to give a "taste" of the full BNC rather than to mirror
its composition in detail -- consists of 184 texts for a total of roughly 2 million words, half speech
and half writing. Lee complains that many of his categories are totally absent from the Sampler. But
with this total number of texts, there is no way in which the Sampler could have adequately
Language Learning & Technology
74
Guy Aston
Text Categories and Corpus Users…
documented 70 different categories while allowing reasonable generalizations at more macroscopic
levels, such as speech versus writing. Would Lee really have wanted the number of university lectures
on science in the Sampler to equal the number of casual conversations? Only, I think, if he were not
interested in spoken texts in general, but particularly interested in science lectures, of which there
would still not have been enough to say much about them.
A further problem with Lee's genre labels is that they may not match entire texts anyway. As he
himself notes, virtually any single text may be analysed as composed of a number of different
subtexts which can be assigned to different genres. For instance, there are 30 texts in the BNC
consisting exclusively of poems, which Lee categorises as W_fict_poetry. However we find much more
poetry occurring in texts belonging to other categories (as quotations, or when the hero of a novel
breaks forth into song, etc.), 3,048 poems in 410 texts overall. Lee's categorisation is not going to
be of much help to the user who wants to study poetry using the BNC. Rather than just those texts
classed by Lee as poetry, s/he would be better advised to choose all those parts of the corpus texts
which are tagged as <poem> elements in the markup (an easy task using SARA; Aston & Burnard,
1998).
With this last caveat in mind, where Lee does have a point is from what Gavioli (2001; Gavioli &
Aston, 2001) calls an example rather than a sample perspective. Corpora like the BNC are designed
to provide sample data from which to infer generalisations about the language as a whole, or about
particular broad categories of texts, concerning frequencies of occurrence and co-occurrence
(collocation, colligation, and so on). However it is also possible to use corpora -- at one's peril -- as
text archives (Atkins, Clear, & Ostler, 1992) from which to retrieve examples of a particular texttype. If I am a teacher of religious education, and what I need for my lesson tomorrow is some
prayers to use with my class -- why not look in the BNC? Since prayers are not a category used in the
BNC text categorisation, to find candidate texts I will have to hope that either the text or its header
(perhaps the text title, or its keywords) contains a form of the lemma prayer or a related word or
phrase (perhaps Amen). A more detailed categorisation of the corpus texts, particularly one which
uses prototypical "folk" genre labels, could be very useful as an aid to find examples of this kind.
This could also be a useful approach when we want to investigate a particular "user category" of
texts. Not, I repeat, in order to generalize about that category, since the corpus cannot be relied upon
to document it adequately, but in order to find examples from which to generate hypotheses. As
mentioned earlier, there are 32 texts in the BNC containing radio or television news scripts
(W_news_script in Lee's taxonomy). Given their limited number, and the fact that they come from a
limited range of sources (two broadcasting stations), it would clearly be unwise to generalize from
these to the genre of broadcast news scripts tout court. What they may provide, however, is a source
of hypotheses about this genre -- hypotheses which must clearly be tested against a different corpus,
one which has been constructed to comprise an adequately-sized sample of texts of this type, and
which satisfactorily covers the variance within this category.
From an "example" perspective, the more descriptive categorizations that are provided within a
corpus the better. For this reason, the incorporation of Lee's categories in the BNC World Edition
(2001) is a very welcome development. For each text, his categorization forms the content of a
<classCode> element in the header of the text (with the attribute scheme="DLee"), and using
SARA98 it is possible to restrict searches to one or more of his categories, and to define
corresponding subcorpora -- subcorpora which can of course be adjusted if the user does not agree
with Lee's attribution of particular texts to particular categories. I have, for instance, used a
subcorpus of lectures from the BNC with a group of trainee conference interpreters who will need to
work with academic monologue, selecting all those texts which Lee categorizes as lectures and then
discarding two or three which seemed too informal and interactive for my purposes. There are nearly
50 lectures overall, on a wide range of topics and by a fair variety of lecturers, and it has proved a
Language Learning & Technology
75
Guy Aston
Text Categories and Corpus Users…
useful collection from which to retrieve examples of particular discourse phenomena for teaching
purposes and from which to generate hypotheses about the ways that lectures seem to work. Useful,
that is, as long as you don't try to interpret it as a "representative sample" allowing reliable
generalizations about lectures as a genre.
ABOUT THE AUTHOR
Guy Aston is Professor of English Linguistics in the School of Modern Languages for Interpreters and
Translators at the University of Bologna, Italy. His main research interests concern the uses of
corpora in language learning and in translation.
E-mail: guy@sslmit.unibo.it
REFERENCES
Aston, G. (in press). The learner as corpus designer. In B. Kettemann (Ed.), Teaching and language
corpora 4 (provisional title). Amsterdam: Rodopi.
Aston, G., & Burnard, L. (1998). The BNC handbook: Exploring the British National Corpus with
SARA. Edinburgh, UK: Edinburgh University Press.
Atkins, S., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing,
7(1), 1-16.
Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243257.
The BNC Sampler. (1999). Oxford, UK: Oxford University Computing Services.
The BNC World Edition. (2001). Oxford UK: Oxford University Computing Services.
Burnard, L. (1995). Users reference guide for the British National Corpus. Oxford, UK: Oxford
University Computing Services.
Dodd, A. (2000). SARA98. Oxford, UK: Oxford University Computing Services.
Gavioli, L. (2001). The learner as researcher: Introducing corpus concordancing in the classroom. In
G. Aston (Ed.), Learning with corpora (pp. 108-137). Houston, TX: Athelstan.
Gavioli, L. & Aston, G. (2001). Enriching reality: Language corpora in language pedagogy. ELT
Journal 55(3), 238-246.
Sinclair, J. McH. (1991). Corpus, concordance, collocation. Oxford, UK: Oxford University Press.
Language Learning & Technology
76
Language Learning & Technology
http://llt.msu.edu/vol5num3/kennedymiceli/
September 2001, Vol. 5, Num. 3
pp. 77-90
AN EVALUATION OF INTERMEDIATE STUDENTS' APPROACHES
TO CORPUS INVESTIGATION
Claire Kennedy and Tiziana Miceli
Griffith University, Brisbane
ABSTRACT
This paper reports on our experience in using a corpus of our own compilation,
Contemporary Written Italian Corpus (CWIC), in teaching intermediate students at
Griffith University in Australia. After an overview of the corpus design and the training
approach adopted, we focus on our initial evaluation of the effectiveness of the students'
investigations.
Much has been written on what can be done with corpora in language learning: what
kinds of discoveries can be made with different types of corpora. There is relatively little
on how learners actually go about investigations. Since we intend for our students to
progress from classroom use to independent work as a result of using a Web-based
version of CWIC, we have been seeking to understand how successful they are at
extracting information from this corpus in the absence of a teacher. Our initial study
highlighted the complexity of the process and the specialized skills required. We found
that lack of rigor in observation and reasoning contributed greatly to the problems that
arose, as did ignorance of common pitfalls and techniques for avoiding them. We,
therefore, conclude the paper with an outline of proposed changes to our apprenticeship
program, aimed at better equipping the students as "corpus researchers."
INTRODUCTION
Much of the literature on the use of corpora in language teaching relates to courses for advanced and
highly motivated students of English for Specific/Academic Purposes (e.g., Johns, 1988, 1991a, 1991b;
Levy, 1992; Mparutsa, Love, & Morrison, 1991; Stevens, 1991; Tribble, 1991) or translation (e.g., Aston,
Gavioli, & Zanettin, 1998; Bernardini, 1998; Gavioli, 1996). So, when contemplating the introduction of
work with corpora into the undergraduate Italian program at Griffith University in Australia, we were
aware of the need to tailor the experience to quite a different target group, for whom Italian is usually a
foreign rather than a second language and whose intentions for its use are less ambitious. In other
contexts, our students might be regarded as reaching intermediate or higher-intermediate levels.
Our aim was to provide these students with a corpus to use primarily as a reference resource while
writing. In view of their proficiency level and the types of writing tasks in which they engage, we sought
a corpus that would supply models of personal writing on everyday topics. At the time, the only corpus of
contemporary written Italian available to us was a collection of newspaper material,1 so we first resolved
to create our own corpus, which we have named CWIC, or Contemporary Written Italian Corpus.
Secondly, we also decided to initiate the students into corpus use in a gradual and guided manner and
thirdly to attempt to evaluate the effectiveness of their work with CWIC as soon as possible. This paper
discusses the implementation of those three decisions, with the focus on an initial evaluation exercise and
its implications for our approach to training students.
Copyright 2001, ISSN 1094-3501
77
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
The CWIC Project: A Corpus for our Teaching and Learning Context2
Most of our students begin their university experience with no prior knowledge of Italian and can attend a
maximum of 400 contact hours during their three years in our program. Additionally, they are not usually
able to spend time in Italy during their studies nor are there many local opportunities for immersion. We
estimate that, on the average, they graduate with "basic vocational proficiency" in reading and listening,
on the scale used in Australia (Wylie & Ingram, 1999), while their ratings in speaking and writing are
lower, somewhere between "basic social proficiency" and "basic vocational proficiency."
In their second year, the students begin intensive writing practice with letters and diaries, some creative
writing and informative pieces based on their own experience. In their third year, the work is more
academic in the sense that they bring their analytical skills to bear on the topics. The writing tasks are
defined as commentaries, reviews or short essays, and treat aspects of the novels and films studied or
television news items and newspaper articles.
In designing CWIC for this context, we were informed by various reports on the merits of small corpora
for language learners, especially Tribble's advice that "the most useful corpus for learners … is the one
which offers a collection of expert performances in genres which have relevance to the needs and interests
of the learners" (1997, p. 3) and Aston's recommendation of corpora restricted to familiar text types and
topics (1997, p. 62). We envisaged CWIC as complementing the newspaper corpus by providing models
of texts by non-professional writers, including personal correspondence, although we chose to include
some journalistic writing as well. We refined our general selection criterion of contemporary written
usage to the following: short, written texts of specific text types (see Table 1), produced since 1990, by
adult native speakers of Italian using non-specialist language.
Table 1. Text types included in CWIC
By non-professional writers
private letters
business and official letters
private email messages
business and official email messages
email messages to mailing lists
letters to experts in magazine columns
By professional writers
experts' responses
articles in regular magazine columns
film reviews
Within the constraints of physical access to texts and the feasibility of obtaining permission to use them,
our selection has been motivated by the desire to include a range of topics that our students might find
interesting or relevant, in texts likely to be comprehensible. The email lists and magazine columns are a
valuable source of material on a wide range of themes.3 Our interest in content stems from the expectation
that students will come to appreciate the corpus not only as raw material for concordances and frequency
lists but also as a database of whole texts, which can be interesting to browse through collectively or read
individually.
At the time of writing this article, we have approximately 570,000 words, in 2,200 texts by 930 different
authors.4 While we make no claims regarding representativeness of language in general, we can say that
CWIC provides models of expert performances in several of the text types that our students encounter and
are required to produce, during as well as after their studies. It also offers a wealth of appropriate
language that can be used in other writing tasks such as creative writing and essays.
Language Learning & Technology
78
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
The Students' Apprenticeship
Since Johns (1988, p. 24) raised the need for learners to develop strategies of observation for extracting
information from the data, many teachers experimenting with concordancing in the classroom have
favored a gradual and guided approach (Johns, 1991b, p. 31; Stevens, 1991, p. 39; Tribble & Jones, 1997
p. 58). Guiding learners through a series of preliminary concordance-based activities has been presented
as a way of both familiarizing them with various types of investigations that can be conducted and
stimulating this development of appropriate learning strategies through practice (Turnbull & Burston,
1998, p. 18).
We too opted for an “apprenticeship” approach to training the students, intended to promote learning by
example and by experience. We began with the second-year cohort, in a subject that includes a weekly
two-hour writing workshop. For most of the training we used a sub-corpus of 50,000 words containing
texts of each type, so the students could become familiar with the corpus without facing vast arrays of
examples. The activities were initially carried out step by step, with the teacher giving directions through
a series of leading questions, sometimes calling attention to particular examples. The students worked in
pairs or small groups and reported back to the rest of the class.
Interrogation of the corpus was not presented as an end in itself but rather as an integral part of the writing
and grammar work being undertaken. There is considerable attention to morphology and syntax in the
subject, since it is at intermediate level. We started concordancing activities in that context, by examining
verb constructions with direct and indirect objects as well as the behavior and meaning of certain
conjunctions and pronouns.
After the first few sessions, we began to encourage the students to use the corpus while revising their own
written work. Periodically, we presented the class with anonymous sample sentences from the previous
week's writing and worked with them on ways of using the corpus to make corrections. In this way, they
practiced formulating questions, such as "Should we use infine or finalmente here?", and devising
appropriate searches. When marking their work, we pointed out where they might be able to make
corrections themselves, with reference to the corpus. This meant dedicating some class time to individual
problem-solving work, with the teacher circulating to assist as needed.
Finally, we presented applications of the corpus in composing and in pre-writing work, for what we call
"treasure-hunting": finding models of ways to express things. Several such activities were conducted with
a sub-corpus of personal letters. The students first browsed freely through several letters, observing
typical opening and closing sequences. Then, they looked for ways of expressing certain functions, such
as apologizing for not writing sooner, thanking someone for a previous letter, or giving information on
chosen topics such as work, family, or exams. They did this both by skimming sequentially and by
searching on words they thought might be present. For example, ricevere produced the expression Non
sai che piacere mi ha fatto ricevere tue notizie (You don't know how pleased I was to receive your news)
and vita turned up La mia vita sentimentale è veramente uno schifo (My love life is truly lousy). The
students also examined frequency lists for combinations of three or four words, which brought to light a
host of useful sequences, such as Non vedo l'ora (I can't wait), Ci sono novità? (What's new?) and al più
presto (as soon as possible). These proved to be interesting and entertaining to the students, not only as
alternatives to overused expressions, but also as triggers for further searches.
Neither in problem-solving nor in treasure-hunting work did we seek to engage the students in free
exploration without a predetermined aim. There was always a defined goal: to find out how to phrase
something specific in a given text. However, some experimented with "serendipity learning" (Johns,
1988, p. 21) during treasure-hunting activities and we encouraged them to continue to do so in their own
time.
Language Learning & Technology
79
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
Overall, we viewed this introductory semester as a time for preparing students for independent mode
future work with larger corpora outside the classroom. Until now, corpus interrogation has been
performed using the text database software DBT3 Database Testuale (Picchi, 1997), which is installed in
our laboratories. We believe that DBT has a friendly and intuitive user interface and that offers an
appropriate range of functions, including concordancing on single words or expressions (which can be
quite complex), labeling of examples to identify sources, and sorting. Moreover, the length of context
displayed for each example is configurable, and clicking on an example expands the context to fullscreen. There is also a browser for viewing whole texts and the battery of reporting functions includes
frequency lists. Soon, however, students will have access to CWIC from home, since we are currently
working to transfer it to a Web platform, with its own searching software, offering functionality and tools
similar to DBT. In 2001, we will be involving our students in a pilot exercise using CWIC on the Web.
A total of 7 class contact hours out of a total of 26 hours in the writing workshop strand of the subject
were dedicated to CWIC during the semester. The students also worked with the corpus on assignment
tasks for a few hours outside of class time. Only 3 of the 17 students who completed course evaluation
questionnaires said that the amount of time spent was disproportionate to the usefulness of the exercise. In
the future, we intend to more closely examine the relationship between the time invested in
concordancing work training and the benefits attributable to the mastery of this type of reference tool.
The questionnaires, combined with class discussions and individual interviews, were intended to draw out
students' perceptions of certain aspects of the corpus induction experience. Because these findings are the
subject of a separate study, we will only mention some of the main points here. On the positive side, most
students reported that work with the corpus helped them to better understand Italian grammatical structure
and boosted their confidence in correcting their own writing. Their various definitions of what made the
corpus a useful resource can be grouped into three categories: it provides examples of real language; it
allows exploration of the various uses of a given word in different contexts; and it illustrates the specific
functions of certain words and expressions in particular types of text. On the negative side, some stressed
the discouragement felt on not being able to understand all the examples or to identify relevant ones, and
most admitted that they had on occasion found searches too time consuming and frustrating. Our first
evaluation exercise was concerned with this aspect: what creates a successful investigation and what
causes unproductive searches and frustration.
EVALUATION: AIMS AND PROCEDURE
In view of the proficiency level of our students and our intention that they use CWIC and other corpora
outside the classroom, we were keen to understand how effectively they were able to use it on their own,
specifically the mechanics of their investigations and the difficulties they encountered. We found little to
inform us in developing an approach to such a study. Flowerdew (1996, p. 112) drew attention to "a
paucity of critical perspectives in concordancing literature," but his call for more in-depth evaluative work
does not appear to have borne fruit. Much has been written on what can be done with corpora in language
learning -- what kinds of investigations can be conducted with different types of corpora and what kinds
of discoveries are made, usually in a classroom context -- but relatively little in the literature on how
students actually do this, and especially on how they fare on their own.
Two of the studies we located, however, do reflect an interest in evaluating students' independent work.
Turnbull and Burston (1998) analyzed the aims and outcomes of investigations conducted by advanced
students after only minimal training with a concordancer, but mainly with the goal of demonstrating the
importance of adequate training. Closer to our purposes was Bernardini's (1998) examination of the
processes and outcomes of students' exploration of the British National Corpus, as a result of which she
outlined suggestions for making this kind of work more systematic. Among the tendencies she noted were
ignoring variants, not looking for alternative approaches when faced with an obstacle, and making only a
Language Learning & Technology
80
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
summary analysis. While our students were not involved in free exploration of a large corpus, we
anticipated that these kinds of problems were likely to characterize their work, too.
We chose to focus on our students' handling of problem-solving activities while revising a text as the first
stage of our evaluation since much of their work done in class had been like this. Essentially by asking
them to "show their work," we collected data on how they went about using concordances to answer
specific questions while correcting their own or others' work.
The 10 students referred to in the discussion that follows (S1 to S10) came from the top and middle ranks
of the cohort in terms of their achievement in our subjects. For the purposes of this paper, we numbered
them according to their results in the written Italian subject in which the corpus apprenticeship was
conducted as well as in its companion subject in spoken Italian. S1 was the top performer. Of the 10
students, 5 were enrolled in a languages and linguistics degree program, and the other 5 were studying
history, law, or psychology.
Some of the cases cited are drawn from activities individually carried out by the students during the
semester. Evidence comes from their own accounts of how they used CWIC, sometimes in tasks set by us
but oftentimes in those they set for themselves while in the process of editing their own compositions on
given topics.
The majority of the cases come from pair-work sessions held immediately after the end of semester,
which were video-recorded and followed immediately by an interview aimed at extracting a retrospective
account of the students' work. Eight students participated, and the sole criterion for pairing them was their
availability at particular times. They were given two texts to revise. In the first, we set specific tasks by
underlining certain words to indicate where there might be a problem. In the second, we invited them to
decide what issues to deal with for themselves.
We expected that in the investigations they initiated themselves the students would work on relatively
familiar language points, approached with some degree of confidence. The set tasks, on the other hand,
were intended to force them to address types of problems they might not otherwise tackle.
In both the individual and pair-work situations, all the texts we provided for the activities had been
selected from work submitted by students in that subject. Dictionaries and grammar books as well as the
corpus were available at all times. The students were encouraged to use all three resources as they deemed
appropriate.
RESULTS
Overview
We found that the students made many successful investigations, demonstrating a general appreciation of
the types of questions that can be posed, a certain ability to work by analogy, and a preparedness to
review their strategies when a search was leading nowhere. However, our concern was to identify what
went wrong or could be done more efficiently, in order to gain insight into how to improve the
apprenticeship. Our observations suggested that, while knowledge and experience of the language
undoubtedly played a part in how productive the students' work with the corpus was, lack of rigor in
observation and reasoning contributed greatly to their difficulties, as did apparent ignorance of common
pitfalls and techniques for avoiding them. We concluded that our training had not adequately equipped
them as "corpus researchers."
Our Analysis of Learner Investigations
In order to understand what happens in a corpus investigation, we approached it as a four-step process: (a)
formulating the question; (b) devising a search strategy; (c) observing the examples found and selecting
relevant ones; and (d) drawing conclusions. This schema is illustrated below with reference to one of the
Language Learning & Technology
81
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
set tasks from the pair-work sessions. The sentence concerned was Sto cercando l'orario per il corso
LAL3093 (I'm looking for the timetable for the subject LAL3093). The problem is the choice of
preposition: per is often used where for is used in English, but not in this context, where the pattern is
orario di. An appropriate and efficient way of dealing with the task is described in Table 2.
Table 2. Steps in a Corpus Investigation
1.
Formulate the question
"Which preposition can be used after orario when speaking of a
timetable for something?"
2.
Devise a search strategy
Search on orario, with a view to checking what follows it.
3.
Observe the examples and
select relevant ones
Look for examples in which the idea timetable for something is
expressed.
4.
Draw conclusions
Check which word(s) are used with orario in those examples.
Identify the combination orario di and insert it into the target
sentence, making any necessary adaptations.
Occasionally, the students' investigations did not conform exactly to this pattern, as they had no clear
question in mind at the outset of their search. This happened if they were working on a set task and had no
idea what the issue might be, so they performed a preliminary search on the underlined word or
neighboring ones, just to see what came up. If nothing attracted their attention, they abandoned the task,
but if they did notice something they formulated a question and then proceeded through the remaining
steps as outlined above.
The discussion that follows examines students' work on each step in some detail. We frequently use
specific cases to illustrate the types of problems that led to an unsuccessful outcome. Despite this focus on
what goes wrong, our intention is to convey how complex a corpus investigation is, rather than to present
the students' performance as unsatisfactory. We trust the analysis serves to highlight the specialized skills
the learners employed and the variety of factors they are required to bear in mind.
We have not included cases in which an unsuccessful outcome was caused by lack of linguistic
knowledge, although we recognize that proficiency is important, especially in Step 1 and Step 3. In Step
1, for instance, appreciation of whether it makes sense to ask a given question depends to some extent on
familiarity with the target language. In Step 3, of course, not understanding the examples can undermine
even an impeccably conducted investigation. However, our interest here is in identifying problems that
did not appear to result from inadequate proficiency and that could perhaps be overcome by appropriate
training. We, therefore, sum up the discussion of each step in the form of a list of tips for learners. We do
not present these as rules ready to be imparted to future groups of trainees, but envisage drawing them up
together with the students, through collective reflection on investigations carried out in class.
Step 1: Formulating the Question
Before examining what goes on in this step, it is important to note what types of questions were being
dealt with in the investigations. They were not free exploration questions such as "What can I find out
about x?" nor treasure-hunting questions like "In what ways can I express this function?" Instead, the
questions were aimed at checking or correcting a given sentence. Those we encountered in the students'
work were of just three types: (a) "What is/are the correct word(s) in this context to render this
meaning?"; (b) "What construction do I need around this word (or these words) in this context?"; and (c)
"What order should these words be in, in this context?" Each type can be expressed in open form, as
above, or in closed form. For example, two closed forms of the first type of question are: "Can x be used
to render this meaning in this context?" (yes/no form) and "Is x or y the correct word for this meaning in
Language Learning & Technology
82
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
this context?" (multiple choice). Clearly, we use the words "yes/no" here as shorthand for "there is / is not
evidence for this."
We found that some of the questions the students formulated suggested that they might have
misconceptions about the types of questions it is logical to ask, the kinds of information that can be
obtained from a corpus, and the ways clusters of words behave. We grouped the problems we identified
into five categories.
First, there was sometimes insufficient attention to how specific or general a question should be. When
dealing with Sto cercando l'orario per il corso, S3 and S6 asked "Can you say per il corso?" This is too
general a question; while the answer is "yes," this does not help decide whether these words can be used
in the given sentence. It seemed that the students needed to be more conscious of the fact that the actual
combinations of words used in a language are only a subset of the potential combinations (Gavioli 1996,
p. 124). Reflecting on the implications of general or specific questions in their native language could help
students appreciate this problem (provided they do not assume answers can be transposed). For example,
"Can you say for the course?" is not a useful question in English either. We say prerequisites for the
course but aims of the course.
An unnecessarily general question may well eventually lead to a successful outcome, but the investigation
is likely to be inefficient, due to detours to deal with evidence in contexts not relevant to the case at hand.
We observed this in students' handling of one of the individual set tasks, that of choosing between Il
lunedì scorso siamo andati all'università and Lunedì scorso siamo andati all'università for Last Monday
we went to university. The issue is whether the definite article is used with lunedì scorso (last Monday)
and the answer is "no." Some asked the question "How does scorso (last) behave?" rather than "How does
lunedì scorso behave?" This meant dealing with scorso in several contexts, some with an article and some
without.
Second, the students often did not seem to consciously choose whether to frame their questions in open or
closed form. Primarily, they did not take into consideration that a closed question could lead them to a
dead end and the need for a follow-up question. This happened to S1 and S7: after dealing with their
question "Do you say orario per?" they found that they needed a second investigation aimed at answering
"So what do you use after orario?"
The third type of problem was apparent when a question arose only after looking at some examples. In
this situation, students sometimes failed to formulate the question explicitly. One of the set tasks in the
pair work was to check the sentence Auguri per il weekend, with which the writer had intended to say
something like Have a good weekend. Here, auguri is out of place: it usually corresponds more to best
wishes and is used for birthdays and other special occasions. S10 had an idea along those lines,
suggesting, "Maybe they don't say wishes for the weekend, maybe they mean wishes like
congratulations," but she and S8 did not turn that into the question, "So how do you say Have a good
weekend?" which might have led them to search on other words. They just continued to muse upon the
examples of auguri and eventually gave up.
Fourth, there was the fatal lure of prepositions. The students' attention was often attracted to a preposition
itself rather than to the words around it, on which it depended. In some cases, they treated a preposition as
having a meaning in isolation, or as being in one-to-one correspondence with an English preposition, such
as when S4 said to S9 "Doesn't da usually mean from?" Very common indeed was the habit of treating a
preposition as linked only to the words following it. For example, when correcting her own sentence Il
cane è troppo stanco … continuare il gioco (The dog is too tired to continue the game), S5 asked "What
preposition do I want before continuare?" rather than "How do I construct too <adjective> to do
something?"
Language Learning & Technology
83
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
Fifth, there was a tendency to neglect lexical considerations in favor of grammatical ones, to focus on
how to combine words rather than whether they could be used at all in the given context. For example,
when presented with Non mi sorprenderebbe imparare che ho fatto molti errori (It wouldn't surprise me
to learn I've made many mistakes), nearly all the students considered only the construction of the
sentence. They did not question whether imparare could be used for learn in this sense.
On the basis of our observations, an initial set of tips for Step 1 might be as shown in Table 3.
Table 3. Examples of Tips: Step 1
•
Try to state your question precisely.
•
Ensure it is specific enough for the situation you are dealing with.
•
If it is in yes/no or multiple choice form, consider whether an open question would be more
appropriate. For example, rather than asking "Does y come after x?" you might want to ask "What
comes after x?"
•
Keep in mind both lexical and grammatical issues.
•
In your dealings with prepositions:
When considering what word(s) a preposition might be linked to, look both to the right and to the left,
and to a distance of a few words.
If you are trying to choose a preposition for a particular context, remember the possibility that no
preposition is required there.
Step 2: Devising a Search Strategy for a Given Question
We identified the components in the definition of a strategy as (a) choosing the word(s) to search on and
(b) deciding whether and how to use other options such as sorting examples or consulting a dictionary or
grammar book. Choosing the word(s) to search on is not necessarily just a matter of deciding which are
the key words in the question. It may entail picking words that can be substituted for these, such as
different forms of a lemma or words that belong to the same set (like days of the week, colors, possessive
pronouns).
Students did not always pay sufficient attention to exactly defining the construction they were dealing
with and therefore distinguishing its fixed and variable parts. Often this coincided with a certain difficulty
in framing the question. One example of many was the treatment of Non mi sorprenderebbe imparare (It
wouldn't surprise me to learn) by S1 and S7. They wondered whether a preposition is required between
the conjugated verb sorprenderebbe and the infinitive imparare. Their strategy was to search on
imparare. It did not seem to occur to them that it was the behavior of sorprendere that mattered, that the
construction is a variant of Non mi sorprenderebbe <infinitive> or, more generally, Non <object
pronoun> <conjugated form of sorprendere> <infinitive>.
Nor did students seem very concerned that a strategy be efficient. That is, they did not direct effort at
obtaining a workable number of examples -- not too many -- with as many as possible of them likely to be
relevant to the problem at hand. This means including as much as possible in a search combination
without, of course, prejudicing a successful result by making it too restrictive. During the pair work, S4
and S9 set themselves the task of deciding between niente da fare and niente di fare for nothing to do.
They searched on fare (to do) and sorted the examples so as to check on the left for di fare and da fare.
Since fare is present in a myriad of idiomatic expressions, it provided a host of irrelevant examples to sort
and scroll through.
Language Learning & Technology
84
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
Additionally, there were several cases of students overlooking the option of trying other forms of the key
word (or substituting another word for it altogether) in the event of not finding any examples. When
dealing with Italian nouns, verbs, and adjectives, if a first search fails to produce sufficient evidence, it
should be automatic to check different inflected forms. In one instance, S1 posed the question of whether
the adjective estrema (extreme, singular feminine) should precede or follow the noun and made her
decision on the basis of only one example. Had she searched on the masculine and plural forms as well
(or used the stem together with a wildcard character: estrem*), she would have had several examples to
consider, and she would have been able to detect the mobility of this adjective, with the choice of position
reflecting degree of emphasis. Another aspect of this issue is that if students did think to search on
another form of a verb, they tended to only try the infinitive, apparently transferring dictionary practice to
corpus use.
Clearly, there are many factors to take into account when devising a strategy, and it is not surprising that
the students did not always think of all possible ways of fine-tuning their approaches. There were several
occasions in which they neglected to use certain options to their best advantage. For example, when
searching for a combination of words, students sometimes forgot to specify if they were interested in the
words only when they were adjacent. This is quite simply achieved by setting a maximum-distance-apart
parameter to 1. We noticed the converse problem too, of setting this parameter to 1 automatically, without
considering whether the search words were likely to be separated in the examples by intervening words,
phrases or even clauses. Sorting features were also used somewhat indiscriminately. The words linked to
a keyword may well not be adjacent to it, and looking at sorted output sometimes distracted the students'
attention from useful examples.
Finally, there were times when the students were apparently so engrossed in the corpus that they forgot to
use the dictionary. This was noticeable at moments when they realized they were dealing with a word that
did not have the desired meaning in a certain context. They got as far as checking the wrong word in the
corpus and establishing that there was no evidence for using it in the target sentence, but then simply
relied on their own memory or imagination in determining what to use instead, rather than reaching for
the English-Italian dictionary.
In light of these observations, we drew up a basic set of tips for Step 2, shown in Table 4.
Table 4. Examples of Tips: Step 2
•
Think about how efficient your strategy will be. Is it likely to generate many irrelevant examples
alongside the useful ones? If so, maybe you should restrict your search further.
•
Check if you are dealing with a variant of a general pattern, with a fixed part and a variable part, as
you may want to search only on the fixed part.
•
If you are not satisfied with the examples found, think about using wildcards or substituting
something else for one of the search words: another form of the same lemma or a word that may be
equivalent in the context that interests you.
•
Remember the English-Italian dictionary if you are looking for potentially appropriate words.
Step 3: Observing the Data and Selecting Examples
Surprisingly often, students lost sight of the importance of selecting examples with a view to matching
form and meaning closely to the requirements of the target sentence. For example, while editing a
sentence using the adverb ancora to mean "again," S8 investigated the behavior of ancora, saying that she
was interested in its position with respect to the verb it modified. Her eventual construction was fine, but
in none of the four examples she cited as her models did ancora have the meaning again nor did it always
modify a verb.
Language Learning & Technology
85
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
Most of the time, the students did check the meaning and structure of examples, but they were not always
bent on finding a close match, even when excellent evidence was readily available. It became clear to us
that there were specific traps for those who were anything less than rigorous in the selection of examples.
One was the distraction offered by a majority of examples being of one kind. In this situation, useful
examples belonging to a minority category were easily ignored. In the case of lunedì scorso, most
students who searched on scorso were attracted by the examples of il mese scorso (last month) and l'anno
scorso (last year), which include the definite article, and used these as their model. This was despite the
fact that 2 of the 15 examples found illustrated exactly what was needed, in the form of last Monday and
last Thursday, without the article. Another trap for students who did not attend closely to meaning was
the way some combinations turn up due to the chance juxtaposition of phrases, not because they form a
lexical phrase themselves.
A frequent problem was that of students not noticing something if it was not what they were looking for.
This was the case when S4 and S9 were trying to establish whether they should use cercare a for to look
for. They simply did not see the several examples on the screen of cercare used transitively to mean to
look for, because they were intent on choosing a preposition. The problem of not noticing all the
information given could be observed also at the moment of applying an example as a model in the target
sentence. In an individual task, S9 wanted to see what verb construction to use in After bringing the stick
back, and her first attempt included dopo restituendo (after <gerund>). She then looked up dopo and
found an example, which included dopo aver chiuso (after having closed), or dopo aver <past
participle>. However, she appeared to notice only the pattern dopo aver, and so she just inserted aver
into her first guess, producing the hybrid dopo aver restituendo. Some tips for step 3 are shown in Table
5.
Table 5. Examples of Tips: Step 3
•
Remember to check the meaning of examples you want to use as evidence, and seek out those that
most closely match the requirements of your target sentence.
•
Try not to be influenced by assumptions about what you will see in the examples. Look to the left and
right of keywords to see which words are linked to them. The words you are expecting to find may
not be present, and vice versa.
•
Try not to be attracted only to the types of usage of a word that occur most frequently. The type you
are interested in may be a less common case.
Step 4: Drawing Conclusions
The observations we made on this step primarily concern problems in reasoning, particularly the
implications students drew from the number of examples found by a search. When only one or very few
examples were found, the students tended to lack confidence in the result, evidently assuming that many
illustrative examples are necessary to establish a case. This reflected a lack of appreciation for the fact
that what matters is the quality rather than the quantity of examples. Depending on the type of question
addressed, one example that is suitably analogous to the target can be sufficient for a "watertight case."
On the other hand, of course, if only a few examples are found, they may be the result of chance
juxtapositions, the reality being that no relevant examples are present. This suggests a more general issue
about numbers of examples: If many turn up when few are expected, or vice versa, the significance of this
should be considered. The students sometimes expressed perplexity (for example, S10 said at one point,
"Wouldn't you think there'd be a lot more examples?") but failed to act on this dilemma.
Various invalid conclusions were drawn at times when no examples were found. These included, "The
phenomenon does not exist" in place of "There is no evidence for it in this corpus"; "The answer is not x
Language Learning & Technology
86
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
so it must be y" when it was only a matter of supposition that x and y should be the only options; and "The
search didn't work" or "We didn't find out anything," because the search had not produced the expected
results. Some tips based on these observations are shown in Table 6.
Table 6. Examples of Tips: Step 4
•
Even if you have only one example as evidence, it may be enough on which to base your case.
Remember that what matters is how good your evidence is, not how much of it there is.
•
If you have found only a few examples when you were expecting many, or vice versa, you may need
to think about what this means. Why were you expecting to find many or only a few? What has
affected the result?
•
If you have found no examples, think carefully about what conclusion you can draw. Make sure you
relate your conclusion to the question that you initially posed.
WHERE TO FROM HERE?
In the investigations we analyzed, difficulty in understanding examples was very rarely the sole or even
the primary cause of invalid results. In fact, the above discussion is based entirely on cases in which it
was unlikely that the examples dealt with were hard for the students to understand. Furthermore, in the
pair work it was often the student with lower proficiency (as far as that can be measured by results in our
subjects) who appeared more competent in using the corpus to tackle a problem. There were many
instances in the pair work where S7, S9 and S10 led the way for S1, S4, and S8 respectively, by showing
insight in formulating a question, using clear reasoning in devising a strategy, or paying attention to
examples.
By this we do not mean to suggest that language proficiency is irrelevant nor to deny how daunting arrays
of examples can be. We simply intend to underline that, in each of the four steps, we identified specific
problems that seemed to be due to inadequate corpus-investigation skills. These were accompanied by an
evident lack of awareness on the students' part of how easy it is for an investigation to be derailed.
So the apprenticeship now appears far more complex than we had thought. The evaluation has highlighted
the need to focus on treating the students as trainee researchers. As in any other field of research, it is
necessary for novices to acquire certain attitudes and habits of reasoning. They need to become
acquainted with underlying principles and to master specific techniques, which are not necessarily
intuitive. We are, therefore, reviewing our approach in two main areas.
First, we are looking for ways to encourage students to distinguish between observation and interpretation
of data so as to try to free the observation phase of assumptions. To prepare the students to exploit the
"direct access to the data," that a corpus provides (Johns, 1991b p. 30), we must convey to them the
importance of observation rigor to precede interpretation of what is observed. This means that work
aimed at raising consciousness of the idea that language is made up of "lexical phrases" rather than single
words and that putting a sentence together is a matter of arranging patterns of words, attending to "the
ways they can be pieced together, along with the ways they vary and the situations in which they occur"
(Nattinger, 1980, p. 341). Careful observation in relation to these aspects can be expected to help
overcome assumptions.
In addition to including explicit observation exercises in the training program, we are also making a much
more general change. In order to "market" the benefits of observation to the students, we have decided to
entirely reverse the order of our approach so as to start with treasure-hunting, and borrowing chunks of
appealing language while composing texts. Subsequently, we will move on to the use of concordances to
solve specific problems regarding word use, while revising texts. In this way, we mean to highlight, from
the outset, the value of a corpus as a database of whole texts and of models of complete utterances and set
Language Learning & Technology
87
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
phrases. We hope that by beginning with treasure-hunting we can encourage students to appreciate
exploration of the corpus without prior assumptions about the data that will be found and cultivate in
them a more open mind towards the ways strings of words belong together.
The second key aspect of our new approach, as foreshadowed in the preceding section, will be that of
engaging the students in reflection on the processes of their problem-solving investigations. Once they
have some experience in using the corpus to answer questions, we will introduce exercises -- perhaps
presented in the form of "spot what goes wrong" -- aimed at collectively deriving a checklist of tips along
the lines of those drafted above.
CONCLUSION
We recognize that during corpus investigations by language learners, there is considerable room for error
due to lack of knowledge of the target language. However, we propose that the development of
appropriate research habits -- incorporating observation and logical reasoning as well as techniques in
corpus searching -- could reduce other causes of error to a minimum. Although we do not go so far as to
suggest that learners need formal training in logic in preparation for corpus work, our evaluation of the
ways students go about problem-solving with CWIC has convinced us of the importance of an awareness
of logical principles applicable to this kind of operation.
The plan outlined above to revise our approach to training reflects this conviction. We expect that an
apprenticeship oriented toward the development of "corpus research" skills will not only help students
make the most of corpora but will also benefit other areas of their language learning as well, enhancing
their capabilities with other reference tools in particular. Our next step will be to examine the
effectiveness or lack thereof of the new approach, especially in work with CWIC on the Web.
NOTES
1. The Corpus of Italian Newspapers, available from the Oxford Text Archive at http://ota.ahds.ac.uk,
contains 1,200,000 words from four dailies.
2. A more detailed description of the corpus and compilation process is in a paper submitted for the
proceedings of the conference Teaching and Language Corpora 2000.
3. Some of the themes of magazine columns selected so far are health, education, personal problems,
young people's issues, pet care, home computing, current events, social issues, science, and spiritual
and theological questions. We have explored email lists belonging to groups of women, gays and
lesbians, animal liberationists, translators and interpreters, vegetarians, mountain climbers, Italians
overseas and fans of Totò, and on issues to do with politics, entertainment, current events, and
personal problems.
4. The composition of the corpus is roughly 50% email, 5% letters, 40% magazine material (including
letters from the public), and 5% film reviews. Non-professional writers account for over 75% of the
content. The number of texts by a single author ranges from 1-10 for most of these to 30-40 for
magazine column hosts.
ACKNOWLEDGMENTS
We thank Dr. Mike Levy and the anonymous reviewers for valuable feedback on an earlier version of this
paper.
CWIC was developed with the assistance of grants from Griffith University and the Australian
Government's Committee for University Teaching and Staff Development.
Language Learning & Technology
88
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
ABOUT THE AUTHORS
Claire Kennedy and Tiziana Miceli are lecturers in Italian at Griffith University in Brisbane.
Email: C.Kennedy@mailbox.gu.edu.au, T.Miceli@mailbox.gu.edu.au
REFERENCES
Aston, G. (1997). Enriching the learning environment: Corpora in ELT. In A. Wichmann, S. Fligelstone,
T. McEnery, & G. Knowles, (Eds.), Teaching and language corpora (pp.51-64). London: Longman.
Aston, G., Gavioli, L., & Zanettin, F. (Eds), (1998). Proceedings of corpus use and learning to translate
conference, University of Bologna, Bertinoro. Retrieved November, 8, 2000, from the World Wide Web:
http://www.sslmit.unibo.it/cultpaps.
Bernardini, S. (1998). Systematising serendipity: Proposals for large-corpora concordancing with
language learners. Proceedings of TALC98 (pp. 12-16). Oxford, UK: Seacourt Press.
Flowerdew, J. (1996). Concordancing in language learning. In M. Pennington (Ed.), The power of CALL
(pp. 97-113). Houston, TX: Athelstan.
Gavioli, L. (1996). Corpus di testi e concordanze: Un nuovo strumento nella didattica delle lingue
straniere [Text corpora and concordances: A new tool for foreign language teaching]. Rassegna Italiana
di Linguistica Applicata, 2, 121-146.
Johns, T. (1988). Whence and whither classroom concordancing. In T. Bongaerts, P. De Haan, S. Lobbe,
& H. Wekker (Eds.), Computer applications in language learning (pp. 9-27). Dordrecht, The
Netherlands: Foris.
Johns, T. (1991a). Should you be persuaded: Two samples of data-driven learning materials. English
Language Research Journal, 4, 1-16.
Johns, T. (1991b). From printout to handout: Grammar and vocabulary teaching in the context of datadriven learning. English Language Research Journal, 4, 27-45.
Levy, M. (1992). Integrating computer-assisted language learning into a writing course. CAELL Journal,
3(1), 17-27.
Mparutsa, C., Love, A., & Morrison, A. (1991). Bringing concord to the ESP classroom. English
Language Research Journal, 4, 115-133.
Nattinger, J. (1980). A lexical phrase grammar for ESL. TESOL Quarterly 14(3), 337-344.
Picchi, E. (1997). DBT3 Database Testuale. Consiglio Nazionale delle Ricerche, Italy. Distributed by
Lexis Progetti Editoriali s.r.l. See http://www.lexis.it.
Stevens, V. (1991). Classroom concordancing: Vocabulary materials derived from relevant, authentic
text. English for Special Purposes Journal, 10, 35-46.
Tribble, C. (1991). Concordancing and an EAP writing program. CAELL Journal, 1(2), 10-15.
Tribble, C. (1997). Improvising corpora for ELT: Quick-and-dirty ways of developing corpora for
language teaching. Paper presented at the first international conference "Practical Applications in
Language Corpora," University of Lodz, Poland. Retrieved November 8, 2000, from the World Wide
Web: http://ourworld.compuserve.com/homepages/Christopher_Tribble/Palc.htm#Top.
Tribble, C., & Jones, G. (1997). Concordances in the classroom: Using corpora in language education.
Houston, TX: Athelstan.
Language Learning & Technology
89
Claire Kennedy and Tinizia Miceli
An Evaluation of Intermediate Students' Approaches…
Turnbull, J., & Burston, J. (1998). Towards independent concordance work for students: Lessons from a
case study. ON-CALL, 12(2), 10-21.
Wylie, E., & Ingram, D. (1999). International second language proficiency ratings: Master general
proficiency version (English examples). Brisbane, Australia: Centre for Applied Linguistics and
Languages, Griffith University.
Language Learning & Technology
90
Language Learning & Technology
http://llt.msu.edu/vol5num3/thompson/
September 2001, Vol. 5, Num. 3
pp. 91-105
LOOKING AT CITATIONS:
USING CORPORA IN ENGLISH FOR ACADEMIC PURPOSES
Paul Thompson
Reading University
Chris Tribble
King's College London University & Reading University
ABSTRACT
Appropriate reference to other texts is an essential feature of most academic writing, and
we should expect courses in academic writing to sensitize students to the choices that are
available to them when they decide to refer to other texts. A brief review of popular EAP
writing textbooks finds, however, that attention is given mainly to surface features of
citation, focusing on quotation, summary, and paraphrase.
Analysis of a purpose-built corpus of academic text can reveal much about what writers
actually do, and can also generate rich speculation on why writers do what they do.
Extending Swales' (1990) division of citation forms into integral or non-integral, we
present a classification scheme and the results of applying this scheme to the coding of
academic texts in a corpus. The texts are doctoral theses, written in two departments:
Agricultural Botany and Agricultural Economics. The results lead into a comparison of
the citation practices of writers in different disciplines and the different rhetorical
practices of these disciplines. Comparison with Hyland (1999), which looks at citation
types in research articles, also indicates differences between genres.
We then look at examples of EAP student writing and apply the same analysis to these
texts. The results show that the novice writers use a limited range of citation types, and
we suggest that teaching should focus on extending the range of choices available to
students. Lastly, we introduce a number of class activities in which students conduct their
own analyses of citation practices in small corpora, to develop genre awareness, and we
evaluate these activities.
INTRODUCTION
The growing interest in the application of corpus tools in language education, and the spread of "datadriven learning" (Tim John's coinage) is evidenced by the papers in this edition of Language Learning
and Technology, and in recent publications (e.g., Burnard & McEnery, 2000). We will not, therefore,
review the development of classroom concordancing in this short article or argue for its relevance to
English language teaching, but will look immediately at an area of possible application for specific
corpora.1
In this article, we report work that has used corpora to research a particular aspect of academic writing
(citation practices, across the disciplines), how current ELT materials address the language features that
were the focus of this research, and how corpus tools can be used to supplement published materials to
give learners in EAP writing classes opportunities to extend their understanding of this central aspect of
academic discourse.
Copyright © 2001, ISSN 1094-3501
91
Paul Thompson & Chris Tribble
Looking at Citations…
CORPUS-BASED RESEARCH INTO CITATION PRACTICES
Making references to the literature is an essential part of most academic writing, and it is also a source of
considerable difficulty for most novice writers (Borg, 2000; Campbell, 1990). Some of the reasons that
academic writers are expected to make references are to integrate the ideas of others into their arguments,
to indicate what is known about the subject of study already, or to point out the weaknesses in others'
arguments, aligning themselves with a particular camp/school/grouping. Novice writers may face
problems because they are not at the appropriate stage of cognitive or intellectual development (Britton,
Burgess, Martin, McLeod, & Rosen, 1975; Pennycook, 1996), or because of cultural factors (Connor,
1996; Fox, 1994). Failure to acknowledge the source of ideas can lead to charges of plagiarism, whereas
inexpert phrasing of reporting statements can lead to confused or misleading indication of both the
writer's, and the cited author's, stance (Groom, 2000).
Swales (1981, 1986, 1990) has pioneered the study of citation analysis from an applied linguistic
perspective. He created clear formal distinctions between non-integral and integral citation forms: The
former are citations that are outside the sentence, usually placed within brackets, and which play no
explicit grammatical role in the sentence, while the latter are those that play an explicit grammatical role
within a sentence. The citation at the beginning of this paragraph is an integral citation. He also used the
terms "short" and "extensive," to describe citations that are at a single sentence level and those that
encompass more than one sentence. These distinctions provide useful starting points but they do not
provide insights that will help student writers understand which citation type to use in which context.
Alongside Swales' work, there has been substantial research into the correlation of verb tense and voice in
reporting verbs with function (most notably Shaw, 1992, but also worthy of mention are Hanania &
Akhtar, 1985, and Malcolm, 1987).
Analysis of academic text corpora has the potential to inform our knowledge about the different forms
and functions of citations in academic writing. Pickard (1995) used a small corpus of applied linguistics
articles to investigate the citation practices of "expert" writers. On the premise that novice writers tend to
overuse particular items in their references, such as "say," she investigated citation practices in the corpus
to find out what expert writers do. Using concordancing software, she was able to produce statistical
information to identify preferences among her writers for integral or for non-integral citation forms, and
to identify the different grammatical forms of integral citations (subject, agent, genitive noun phrase,
etc.). This was a useful preliminary study. The limitations were that the corpus was small, and there was
little discussion of the reasons why writers choose one form rather than any other; the categories are
based on syntactic distinctions rather than functional. More importantly, however, it is not clear whether
her discoveries about the practices of a small number of applied linguistics writers can be generalized to
"expert" writers across all the disciplines. It seems likely that writers in different disciplines follow
different rhetorical conventions and have different preferences.
Two recent studies of citation practices in academic texts that test this assumption are Hyland (1999) and
Thompson (2000). These two studies were based on the analysis of more substantially sized corpora, each
investigating a different genre of academic writing. Hyland looked at citations in a corpus of 80 research
articles, composed of 10 journal articles from different disciplines (see Table 1 below for details), while
Thompson (2000) examined differences in citation practices in a corpus of doctoral theses. The latter
corpus contains 16 theses written in two departments at the University of Reading, 8 theses from the
Department of Agricultural Botany, and 8 from the Department of Agricultural and Food Economics.
Language Learning & Technology
92
Paul Thompson & Chris Tribble
Looking at Citations…
Table 1. Number of Citations in Hyland (1999) and Thompson (2000) Corpora
Discipline
Mechanical Engineering
Physics
Electronic Engineering
Marketing
Philosophy
Applied Linguistics
Sociology
Biology
Agricultural Botany
Agricultural Economics
Av. per paper
27.5
24.8
42.8
94.9
85.2
75.3
104.0
82.7
Av. per thesis
248.8
333.5
per 1,000 words
7.3
7.4
8.4
10.1
10.8
10.8
12.5
15.5
9.04
5.25
Both studies investigated variation in practice in disciplinary discourses and made use of frequency and
concordance data to investigate dispersion, frequency, and patterning across large quantities of text. Table
1 shows the figures for instances of citation in the two corpora, with the middle column showing the
average number of citations per text, and the right column showing the average number of citations per
1,000 words of running text. The lower density of citations amongst the science and technology articles
(7.3-8.4) contrasted with higher incidence among the social science articles (10.1-12.5) while Biology
stood out as exceptional with 15.5. Hyland postulated a difference in practice here between "hard" and
"soft" disciplines, using terminology drawn from Becher (1989), and speculated that Biology stood out
from the other sciences because it is a relatively new discipline. The distinction between "hard" and "soft"
disciplines may, however, prove to be reductive; Becher himself prefers a multi-dimensional model with
added axes of "applied" and "pure," "rural" and "urban." The fact that the Biology texts are so markedly
different from the Physics and Engineering texts is evidence that the simple distinction between "hard"
and "soft" is inadequate.
As can be seen in Table 1, the density of citations in the doctoral theses is much lower. If we presume that
the Agricultural Botany theses should be roughly comparable to the Biology articles, the density is
approximately three fifths lower And while there is no easy comparison between the Agricultural
Economics and any of the disciplines in Hyland's study, the figure of 5.25 is substantially lower than any
of the figures for the research articles. It can be seen, therefore, that the two genres are marked by
different degrees of use of citations. One explanation for this is that the types of texts produced in these
two genres are of different lengths: Articles usually average between 2,000 and 5,000 words, while in
Thompson's study, the average length of an Agricultural Botany thesis was 31,000, and the average length
of an Agricultural Economics thesis was 63,000. As articles are shorter texts, there is presumably a need
for a more condensed style of writing.
Table 2 shows the relative percentages of the two types of citation, integral and non-integral, in Hyland
(1999) and Thompson (2000). These figures show firstly that there is considerable variation in citation
practice between the different disciplines, with Philosophy being the only discipline that prefers the
integral form over the non-integral, greater emphasis being placed on the arguments of different
individuals. Secondly, it is interesting to note that in the case of the Agricultural Economics theses
writers, the integral type was also preferred. Although no direct comparison can be made between
Agricultural Economics and the disciplines in Hyland's study, one would not expect Agricultural
Economics to be closest to Philosophy. A more plausible explanation is that citation practices in the two
genres are different: Thesis writers in Agricultural Economics make greater use of integral citations for
Language Learning & Technology
93
Paul Thompson & Chris Tribble
Looking at Citations…
reasons that become clear from closer reading of the texts. One obvious point is that the length of texts in
the two genres is markedly different: the articles in Hyland's corpus range from 3 to 31 pages in length,
whereas the Agricultural Economics theses in Thompson's corpus are around 200 pages long. In long
texts, such as the Agricultural Economics theses, or in book length treatments of research, there is a
higher likelihood that references to leading researchers in the field will be elaborated and give greater
prominence to the author(s).2
Table 2. Ratios of Non-Integral to Integral Citations by Discipline in Hyland (1999) and Thompson
(2000)
Discipline
Biology
Electronic Engineering
Physics
Mechanical Engineering
Marketing
Applied Linguistics
Sociology
Philosophy
Doctoral theses
Agricultural Botany
Agricultural Economics
Non-integral
90.2
84.3
83.1
71.3
70.3
65.6
64.6
35.4
Integral
9.8
15.7
16.9
28.7
29.7
34.4
35.4
64.6
66.5
33.5
38.1
61.9
Table 3 below shows the percentage of citations in the two corpora that incorporate direct quotation form
the source text. It is clear from these figures that quotation is a relatively common feature in the social
science and humanities texts but that it is scarcely used in the science texts. Where quotation is used in the
science texts, (viz. the 0.8% figure in the Agricultural Botany column), the citation is a definition, while
many of the Agricultural Economics quotations are evaluative comments.
Table 3. Sample Percentages of Citations in Two Corpora That Include Direct Quotation
Articles (Hyland, 1999)
Biology Electronic Sociology
Engineering
0
0
13
Doctoral theses (Thompson, 2000)
Applied Linguistics
Agricultural Botany
Agricultural Economics
10
0.8
8
The statistics reported here suggest that there are clear divergences in the citation practices of writers in
different disciplines, and also between genres of academic writing.
The level of analysis at this stage, however, restricts the kinds of questions that one can ask, and it is
necessary to develop a more sensitive set of categorisations. In the next section, we describe a set of
categories drawn from Thompson (2000) , and we then proceed to pose further questions.
Language Learning & Technology
94
Paul Thompson & Chris Tribble
Looking at Citations…
NON-INTEGRAL CITATION
Source
Non-integral citations perform a range of functions. The first function is to attribute a proposition to
another author. The proposition might be a statement of what is known to be true, such as in the factive
report of findings in other research, or the attribution of an idea to another, as in this example:
Citation is central ... because it can provide justification for arguments (Gilbert, 1976)
The citation provides evidence for a proposition which can remain unchallenged if the writer is in
agreement with it, or can be countered by the ensuing argument. Let us call this type of citation source
because it indicates where the idea comes from.
Identification
The second type of non-integral citation identifies an agent within the sentence it refers to. An example of
this is
A simulation model has therefore been developed to incorporate all the important features
in the population dynamics (Potts, 1980)3
where the information within the parentheses identifies the author of the study referred to. Instead of
including the name of the author within the sentence ("Potts [1980] has developed..." or "A simulation
model has been developed by Potts [1980]..."), the writer has chosen to focus attention on the information
(Weissberg & Buker, 1990, differentiate between author- and information-prominent citations).
Reference
This type of citation is usually signalled by the inclusion of the directive "see" as in
DFID has changed its policy recently with regard to ELT (see DFID, 1998).
This type of citation is often similar to a source citation in that it can provide support for the proposition
made, but it also functions as a shorthand device: Rather than provide the information in the present text,
the writer refers the reader to another text. This type is particularly common in reference to procedures or
to detailed proofs of arguments which are considered too lengthy to be repeated.
Origin
An example of this type is
The software package used was Wordsmith Tools (Scott, 1996).
Where Source citations attribute a proposition to a source, Origin citations indicate the originator of a
concept or a product - in this case the creator of the Wordsmith Tools programme.4
INTEGRAL CITATIONS
A clear distinction can be made between integral citations which control a lexical verb (Verb controlling)
and those that do not (Naming). A third type is the reference to a person that is not a full citation -- this
has been called a Non-citation form.
Verb Controlling
The citation acts as the agent that controls a verb, in active or passive voice, as in
Davis and Olson (1985) define a management information system more precisely as...
Language Learning & Technology
95
Paul Thompson & Chris Tribble
Looking at Citations…
Naming
In Naming citations, the citation is a noun phrase or a part of a noun phrase. The distinction here is
primarily grammatical but the form also implies a reification, such as when the noun phrase signifies a
text, rather than a human agent:
Typical price elasticities of demand for poultry products in Canada, Germany and the UK
are shown in Harling and Thompson (1983)
Another example of reification is when the naming citation identifies a particular equation, method,
formulation or similar construct with individual researchers, as in
In this paper, the management information system (MIS) definition of Davis and Olson
(1985) has been used.
An alternative type of naming citation is that which refers generally to the work or findings of particular
researchers:
Work by Samuel and East (1990) demonstrated that variety and seed rate had
considerable effects on yield and quality aspects
In this case, the naming citation is similar to a verb-controlling citation in that it reports work done by
particular researchers.
Non-citation
There is a reference to another writer but the name is given without a year reference. It is most commonly
used when the reference has been supplied earlier in the text and the writer does not want to repeat it. For
example
The "classical" form of the disease, described by Marek, causes significant mortality
losses.
However, instances where a person was invoked through reference to the thinking associated with them in
general, rather than with reference to a specific work or set of works (for example, "Marxist" or
"Darwinian") are not included.
FURTHER EXPLORATION
Employing these categories, it is possible to explore a number of questions about the theses examined in
Thompson (2000):
Q1. Are there differences in the types of non-integral, or integral, citations used by writers in
different disciplines?
Language Learning & Technology
96
Paul Thompson & Chris Tribble
Looking at Citations…
Figure 1. Proportion of citation types used in the two disciplines
As shown in figure 1, writers in Agricultural Botany use the non-integral Source and Ident types much
more frequently, while the Agricultural Economists make far greater use of integral Naming citations
(reasons for which become apparent in Q4 below) and also make more mentions of names without giving
full citation information.
Q2. Are there differences in the practices of writers within the same discipline?
Figure 2. The average number of different citation types per 1,000 words of text found in the eight
Agricultural Botany theses
As can be seen in Figure 2, the density of citations in the individual Agricultural Botany theses varies
from just under 5 per 1,000 words (TAB5) to around 13 (TAB2 and TAB6). TAB7 uses Verb-controlling
citation types far more than any of the other writers, and far fewer non-integral citation types.
Examination of this thesis reveals that the writer makes frequent reference to individual studies and
compares their findings to his own experiments (X found this, and Y reported this. My findings were ...).
TAB 6, by contrast, uses predominantly non-integral citation forms, and prefers to make information
prominent through use of the Identification citation rather than the integral Verb controlling type. TAB6
is a report of a laboratory-based investigation of innovative techniques for isolation of vacuoles, and
Language Learning & Technology
97
Paul Thompson & Chris Tribble
Looking at Citations…
therefore the emphasis is on the techniques, and the subject of study, that is, the vacuoles. Different
writers within one discipline, then, take different approaches to research, and their rhetorical choices are,
to a degree, determined by the nature of the research that they conduct.
Q3. Are different types of citation used in different rhetorical sections?
In the Agricultural Botany theses, it was possible to divide the texts into four types of rhetorical section,
following the conventions that are common in most scientific reports: Introduction, Methods, Results,
Discussion. As can be seen in Table 4, there is considerable variation in the different sections of the
theses, with relatively low use of citations in the Methods and Results sections of the thesis, and a
markedly different set of citation types in the case of the Methods sections. To understand these
variations, it is helpful to think of the hourglass model proposed by Hill, Soppelsa, and West (1982): the
Introduction and Discussion sections of an article take a broad view, relating what is known in the field at
large, while the Methods and Results sections are narrow, focussing on the research itself. While the
Introduction and Discussion sections contain many references to other studies to establish the current state
of knowledge and where the current report fits in, the Methods section contains mainly references to the
methods and techniques of others.
Table 4. Citation Types in Different Rhetorical Sections of AB Theses
Density
Most common types of citation
(per 1,000 words
Introduction
15.6
Source, Identification, Verb controlling
Methods
2.3
Refer, Origin, Naming
Results
2.4
Source (52%)
Discussion
10.1
Source, Identification, Verb controlling
This data shows that there is, then, variation in the density and type of citations used in different rhetorical
sections of a thesis, and similar variation has been found across rhetorical sections in Physics, Chemistry
and Biology masters' theses (Hanania & Akhtar, 1985).
Section
Table 5. The Number of Occurrences of Naming Citations in the Two Disciplinary Groupings
RI Naming
AB
AE
Total occurrences
116
484
Q4. Are there differences in patterns of language around particular citation types?
Close inspection of the different kinds of Naming citation in the theses revealed interesting differences in
the discourses of the two disciplines. Firstly, in terms of simple frequency, it can be seen from Table 5
that this citation type is much more commonly used (by more than four times) in the Agricultural
Economics texts. In order to find out why this might be the case, concordance lines of the Naming citation
type were examined. It was observed that certain patterns were regularly used, such as the three shown in
Table 6.
Table 6. The Number of Occurrences of a Pattern in the Thesis Corpus of Preposition + Naming Citation
Agricultural Botany
3 (12)*
37 (154)
25 (104)
...in X (1991)
...of X (1991)
...by X (1991)
Agricultural Economics
58
70
29
* In the middle column, the figure in brackets shows an adjusted figure which would make the amount
equivalent to the figure in the right column (n*484/116).
Language Learning & Technology
98
Paul Thompson & Chris Tribble
Looking at Citations…
The pattern "in X (1991)" is clearly much more commonly used in the Agricultural Economics theses.
The use of the preposition "in" indicates that the citation is a reference to a book, and this is supported by
the examples given in Table 7. In the Agricultural Botany theses on the other hand, "of" and "by" are
more commonly used and these tend to refer to the research actions, findings, methods, and techniques of
other researchers. Where Agricultural Economics thesis writers use "of," it is noticeable that this also
includes discourse nouns, such as views and suggestions, which are not found in the Agricultural Botany
texts. The Agricultural Economics writers, therefore, appear to be concerned with the texts and concepts
of others, while the Agricultural Botany writers make reference to the research activities and techniques
of other scientists.
Table 7. Frequent Patterns Involving in "in," "by," and "of" in the theses
REASONS FOR VARIATION
We have seen from the quantitative data that there are substantial differences in citation practices between
disciplines and between genres. The types of research work undertaken, the epistemological bases upon
which this research is founded, the conventions of the discipline, and the purposes for which texts are
created all influence the forms of citation made. Looking at citation from a micro-perspective, however,
one might naturally ask, "What is it that leads a writer to choose one citation form over another?" Why,
for example, did we choose, earlier in this paper, to write "Two recent studies of citation practices in
academic texts that test this assumption are Hyland (1999) and Thompson (2000)," rather than "Hyland
(1999) and Thompson (2000) are two recent studies of citation practices in academic texts that test this
assumption"?
Our reason in this case was that we wanted to place the noun phrase beginning "two recent studies" in
theme position within the sentence. Shaw (1992) has observed that this is commonly the factor that
determines voice (active/passive) in reporting verbs in sentence construction. The choice between using a
non-integral identification type ("A simulation model has therefore been developed to incorporate all the
important features in the population dynamics [Potts 1980]") instead of a Verb-controlling type ("A
simulation model has been developed by Potts [1980] ...") is often governed by decisions as to how much
prominence to give to the people involved (cf. Weissberg & Buker, 1990). To a certain extent,
disciplinary convention plays a part here; it is conventional in scientific writing to de-emphasize the role
of the researchers, particularly in controlled experiments, where the claim is that the human factor is not
consequential (Dr. Philip John, School of Plant Sciences, University of Reading, personal
communication).
Language Learning & Technology
99
Paul Thompson & Chris Tribble
Looking at Citations…
WHAT DO EAP TEXTBOOKS SAY ABOUT CITATIONS?
In the previous sections we have outlined a number of research findings regarding both the kinds of
citations that are used in "expert performances" (Bazerman, 1994, p. 131) and the reasons for their use.
Our next task is to review what kinds of advice or models are provided in published materials for EAP
students, and to assess the extent to which these might need complementing. Three widely use EAP
course books were selected for this purpose: Jordan (1992), Trzeciak and Mackay (1994), and Swales and
Feak (1994). In summary, the course books provided surprisingly little advice or guidance to learners.
Jordan (1992) offers little explicit advice and depends mainly on quotations to provide models for
learners to work from. However, as Jordan only exemplified three kinds of citation it cannot be
considered a sufficient treatment of the subject:
1. non-integral "...(Seers, 1979, pp. 27-28) a further dimension is added - 'development now
implies, inter alia, that...'"
2. integral - naming "...For Seers, 'Development is inevitably a normative concept' ... (Seers,
1972, p 22)"
3. integral - verb controlling "... Hicks and Streeten (1979, p 568) identify and review four
different approaches..."
Similarly, Trzeciak and Mackay (1994) comment on only three kinds of citation:
1. integral - verb controlling ...Reporting using paraphrase
2. non-integral - identification ...Reference to source
3. integral - other ...Direct quotation...
But again, they offer little in the way of clear guidance to the apprentice writer and do not draw their
attention to disciplinary differences.
Swales and Feak (1994) give a relatively fuller range of advice and examples, and discuss the contrast
between non-integral/integral and footnote styles. However, they make no comment on the implications
of using contrasting forms, and fall back on references to APA and MLA style guides. They do, however,
usefully comment on the role of citation in abstracts.
In conclusion, it is possible to say that little explicit advice is given in major teaching materials on how to
manage citations in specific disciplines. Instead, there is an emphasis on summary, paraphrase, and
quotation, and on a small set of the mechanical features associated with citation. How then can students
learn more about citation practices in their own subject area?
USING MICRO-CORPORA TO COMPLEMENT EAP WRITING PROGRAMMES
Arguments have been made for the development of micro-corpora as resources for use in EAP
programmes (Hyland, 2000; Tribble, 2001), and a corpus-informed approach appears to have much to
recommend itself so long as relevant data are available. The need for such support is reinforced when
student use of citations is investigated. In the preparation of this paper we reviewed a small collection of
student assignments written at Reading university,5 and identified the following problems:
• Lack of variety of citation types within single texts (e.g., the repeated use of "According to...")
• Lack of linguistic variety + inappropriate selection of verb (e.g., inappropriate use of "claims")
• Absence of certain categories (e.g., Non-integral reference)
• Over-use of non-citational references to authors / authorities
Language Learning & Technology
100
Paul Thompson & Chris Tribble
Looking at Citations…
These findings (supported by extensive experience of teaching EAP students) indicate that two kinds of
resource will be of benefit to learners: firstly, a collection of their own writing, or the writing of their
peers -- a "learner corpus" (Granger, 1998), and, secondly, a collection of examples of writing from the
target discourse community (e.g., research articles/dissertations, etc., from the students' own field of
study), or texts as closely analogous to this kind of writing as possible (e.g., student examination scripts -which are notoriously difficult to get hold of). While the collection of such data banks used to be difficult
and time consuming, with the use of word-processors by students and the growing availability of
electronic texts from the WWW or low cost scanners, and accurate optical character recognition (OCR)
programs,6 these restrictions no longer really apply.
With appropriate text resources to hand, it is relatively easy for teachers and students to begin a
systematic investigation of citation practice in genres that are relevant to their own needs or interests. This
need not require the use of a concordancing program; setting the search function in a word-processor such
as Microsoft Word® to look for "(19" or "(20" with the "Find whole words only" un-checked will provide
rapid access to the dated citations in a text, as will a search for a list of names based on the bibliography
in an article. Obviously, more powerful searching and analysis of the results will be possible with a
dedicated concordancing program.7
An appropriate procedure will be
Stage 1
Stage 2
Stage 3
Stage 4
learners are introduced to a range of citation forms appropriate to their level of study
learners investigate actual practice in relevant texts, reporting back on the form and purpose of
citations they identify
learners investigate the practices of their peers in writing assignments
learners review their own writing and revise in the light of these investigations.
As an example of how such a procedure can be used in an EAP programme we have drawn on the British
National Corpus8 -- making use of Dave Lee's BNC Index (see Lee article in this issue) to make a microcorpus of 22 extracts from one academic journal − Language and Literature.9 The assumption in this case
has been that the texts will be of interest to post-graduate humanities EAP students who are (a) interested
in extending the ways in which they word citations, and who (b) wish to ensure that they are writing in a
way that is appropriate for their field of study. It is possible for a teacher to use the four stage procedure
outlined above to develop a set of learning materials which will achieve this end.
In Stage 1, students will be familiarised with the citation categories we have discussed in this article. In
Stage 2, they will work in different groups to complete a task such as the one given below. The worksheet
was prepared using Wordsmith Tools (Scott, 1996) to find the citations in each article, so that students
can be asked to compare citational practice across comparable texts in a narrow focus disciplinary
context. In this instance we used a simple "catch-all" search string 19??)/??), that is, search for any five
character string beginning with 19 -- remember this is a pre-21st century corpus -- and ending with a
closing bracket, and any three character string ending with a closing bracket to catch other forms. Using
this method, we located 112 citations in the 22 texts.
Language Learning & Technology
101
Paul Thompson & Chris Tribble
Looking at Citations…
Table 8. Citation Worksheet
Example
1. iety of possible surface realisations of that type of isomorphic relation we
know as textual metaphor. Christine Brooke-Rose (1958), in her A grammar
of metaphor, provides a classification of forms of metaphor. However, her
categorisation is unprinci
2. information on the author from an asserted privileged position (much of
the methodology and work of F.R. Leavis (e.g. 1936, 1967) and his followers
is characterised by this approach). </p> <p> The two (realistic) perspectives
remaining are positi
3. hur C. Clarke's 2001: a space odyssey, a novelisation of the film
screenplay written by Stanley Kubrick and Arthur C. Clarke (1968).
</p> <p> This passage cannot be characterised as prototypical SF. It does
not deal with aliens, space, technology,
4. ism and claims to objectivity that have been increasingly questioned in the
past twenty or so years (by writers from Derrida (1975) to Lakoff (1987)).
</p> <p> The middle way seeks to formalise, or at least make explicit,
normative patterns in the
5. of the overall communicative process involving an utterer and a receiver,
very much in the implied spirit of Grice's (1971, 1975) projection of that
communicative situation. However, the ordered pair of functions (f1 and f2)
that are associated with
6. intuitions. The value of the model is not only in recasting the traditional
notion of the Co-operative Principle (from Grice 1975), but also in
describing the resolution of meaning as a principled negotiation between text
and reader. The resolving st
7. though a holistic perspective is taken on literary stylistics in addressing
science fiction. This approach follows van Dijk (1977) in regarding not only
sentences but also textuality as the proper study of linguistics. In this,
continental European
8. ext-world to their cognitive universe (based on their previous familiarity
with the patterns typical of the genre). Eikmeyer (1989), in a paper from a
conference on coherence, points out that reader interpretation depends on the
depth of understandin
9. cott, who filmed his reading of the novel as Blade runner in the early
1980s. Though David Newnham, in the Guardian (24 July 1990), calls the
film post- modernism: the movie, Scott's version is little more than a violent
adventure story. Indiana Jone
10. for example, involves judgements based on textual factors such as the
narrative point of view (Fowler 1986: 127-46; Simpson 1990), the
presentation of verbalisation (Leech and Short 1981: 318-51), the degree of
non-actualised propositions (Leech 198
11. as often in literary discourse, author and reader are considerably
separated by space and time. </p> <p> Eikmeyer (1989: 27) concludes his
paper by introducing a subjectivity condition by which the participants judge
the values of each parameter.
12. the point of view of the reader's judgement of parameters is taken.
Deriving from (and slightly correcting) Eikmeyer (1989: 27), the
prototypically co-operative parameters for the reader are: where J is the
judgement or subjectivity condition. This
Language Learning & Technology
Text
Cit. Cat
j7f
Integral
Verb
Controlling
j7f
Non-Integral
Refer
j7f
Integral
Naming
j7f
Integral
Naming
j7f
Integral
Naming
j7f
Non-Integral
Refer
j7f
Integral
Naming
j7f
Integral Verb
Controlling
j7f
Integral Verb
Controlling
j7f
Non-Integral
Source
j7f
Integral Verb
Controlling
j7f
Integral
Naming
102
Paul Thompson & Chris Tribble
Looking at Citations…
The task for each group in Stage 2 is, therefore, to categorise the citations identified in each article, and
then to pool results and present a summary of the range, purpose and forms of the citations that occur in
this micro-corpus (the categorisations have been provided in this worked-up example).
In Stage 3 students will review their own citational practices (either using Wordsmith Tools to extract
examples for analysis, or being provided with materials prepared by their teacher). Stage 4 will be on
going and will involve a cycle of check-list supported peer review and self evaluation, supported by tutor
comment on writing assignments or departmental work.
CONCLUSION
In this paper we have described a range of citational practices in academic writing along with their
linguistic realisations. We have also reviewed the extent to which published teaching materials provide
learners with opportunities to develop their understanding of, and capacity to form, appropriate citations
in their own writing, and found that, at the moment, these offer relatively little constructive support to
apprentice writers. The need for such support has been underscored by a survey of a small number of
EAP texts written by students on a pre-sessional course at a UK university. If teachers of English for
Academic Purposes are to be able to help learners develop a better control of this essential academic
writing knowledge/skill, we would recommend the accumulation of relevant collections of field specific
texts as a resource for teachers and students of academic writing. By analysing these texts with word
processing software or dedicated corpus tools (or by working with the results of teacher led analysis),
students will be able to develop a fuller understanding of the cultural and linguistic role of citation in their
fields of study and be much better placed to write well formed and appropriate academic texts.
NOTES
1. Readers who wish to explore this area further may wish to start off with Aston (1996) or Tribble &
Jones (1997).
2. We are grateful to one of our anonymous reviewers for the report that their undergraduate class had
analysed the uses of citations in Hyland's (2000) book and found an above-average use of author-assubject integral citations!
3. Potts is also the author of the article that this example is drawn from. In other words, this is a selfcitation.
4. The categories presented here are a reduced set. The categories of Example (non-integral) and the
three types of Verb-controlling (integral) citations, Research/Discourse/Other, in Thompson (2000)
have been removed to make the explanation clearer.
5. Pre-sessional assignments written by postgraduate students on the following themes: EFL in Korea /
Testing in ELT in Pakistan / Project implementation / Food industry / Agroforestry / International
management
6. E.g., Caere corporation's Omnipage Pro® or ABBYY's Fine Reader®
7. E.g., WordSmith Tools (Scott, 1996) or MonoConc Pro (Barlow, 1999)
8. A text resource of 100 million words of late C20 British English that is now available internationally
(contact http://info.ox.ac.uk/bnc/ for more information).
9. The BNC file identifiers for the texts selected are J7F / J7G / J7H / J7J / J7K / J7L / J7M / J7R / J7S /
J7T / J7U / J7V / J7W / J7X / J7Y / J80 / J81 / J82 / J83 / J84 / J85 / J86 / J87 / J88 / J89.
Language Learning & Technology
103
Paul Thompson & Chris Tribble
Looking at Citations…
ABOUT THE AUTHORS
Paul Thompson is a Research Fellow at the School of Linguistics and Applied Language Studies,
Reading. He is currently studying towards a PhD, examining the language and organization of PhD theses
in different disciplines.
e-mail: p.a.thompson@reading.ac.uk
Chris Tribble is the author of Writing in the OUP teacher education series and has a long-term interest in
the use of computers in text analysis and language description, written communication, and evaluation in
education. He lives in Sri Lanka and lectures on the MA at King's College, London University.
email: ctribble@sri.lanka.net
REFERENCES
Aston, G. (1996). Corpora in language pedagogy: matching theory and practice. In G. Cook & B.
Seidlhofer (Eds.), Principle and practice in applied linguistics: Studies in honour of HG Widdowson (pp.
257-270). Oxford, UK: Oxford University Press
Barlow, M. (1999). Monoconc Pro [computer software]. Houston TX: Athelstan
Bazerman, C. (1994). Constructing experience. Carbondale: Southern Illinois University Press.
Becher, A. (1989). Academic tribes and territories. Milton Keynes, UK: Open University Press.
Borg, E. (2000). Citation practices in academic writing. In P. Thompson (Ed.), Patterns and perspectives:
Insights for EAP writing practice (pp. 14-25). Reading, UK: CALS, The University of Reading.
Britton, J., Burgess, T., Martin, N., McLeod, A., & Rosen, H. (1975). The development of writing
abilities, 11-18. London: Macmillan.
Burnard, L., & McEnery, T. (Eds.). (2000). Rethinking language pedagogy from a corpus perspective:
Papers from the third international conference on teaching and language corpora (Lodz Studies in
Language). Hamburg, Germany: Peter Lang.
Campbell, C. (1990). Writing with others' words: Using background reading text in academic
compositions. In B. Kroll (Ed.), Second language writing: Research insights for the classroom.
Cambridge, UK: Cambridge University Press.
Connor, U. (1996). Contrastive rhetoric: Cross-cultural aspects of second language writing. Cambridge,
UK: Cambridge University Press.
Fox, H. (1994). Listening to the world: Cultural issues in academic writing. Urbana, IL: National Council
of Teachers of English.
Granger, S. (Ed.). (1998). Learner language on computer. Harlow, UK: Longman.
Groom, N. (2000). Attribution and averral revisited: Three perspectives on manifest intertextuality in
academic writing. In P. Thompson (Ed.), Patterns and perspectives: Insights for EAP writing practice
(pp. 15-26). Reading, UK: CALS, The University of Reading.
Hanania, E., & Akhtar, K. (1985). Verb form and rhetorical function in science writing: A study of MS
theses in Biology, Chemistry and Physics. ESP Journal, 4(1), 45-58.
Hill, S., Soppelsa, B., & West, G. (1982). Teaching ESL students to read and write experimental research
papers. TESOL Quarterly 16, 333-347.
Language Learning & Technology
104
Paul Thompson & Chris Tribble
Looking at Citations…
Hyland, K. (1999). Academic attribution: Citation and the construction of disciplinary knowledge.
Applied Linguistics, 20(3), 341-367.
Hyland, K. (2000). Disciplinary discourses: Social interactions in academic writing. Harlow, UK:
Longman.
Johns, A. (1997). Text, role and context. Cambridge, UK: Cambridge University Press.
Jordan, R. R. (1992). Academic writing course. London: Nelson.
Malcolm, L. (1987). What rules govern tense usage in scientific articles? English for Specific Purposes 6,
31-44.
Pennycook, A. (1996). Borrowing others' words: Text, ownership, memory, and plagiarism. TESOL
Quarterly, 30(2), 201-230.
Pickard, V. (1995). Citing previous writers: what can we say instead of "say"? Hongkong Papers in
Linguistics and Language Teaching, 18, 89-102.
Scott, M. (1996). WordSmith Tools. Oxford, UK: Oxford University Press.
Shaw, P. (1992). Reasons for the correlation of voice, tense, and sentence function in reporting verbs.
Applied Linguistics, 13(3), 302-319.
Swales, J. M. (1981). Aspects of article introductions. Birmingham, UK: Aston University Languages
Study Unit.
Swales, J. M. (1986). Citation analysis and discourse analysis. Applied Linguistics, 7(1), 39-56.
Swales, J. M. (1990). Genre analysis: English in academic and research settings. Cambridge, UK:
Cambridge University Press.
Swales, J., & Feak, C. (1994). Academic writing for graduate students. Ann Arbor, MI: University of
Michigan Press.
Thompson, P. (2000). Citation practices in PhD theses. In L. Burnard & T. McEnery (Eds.), Rethinking
language pedagogy from a corpus perspective. Frankfurt: Peter Lang.
Tribble, C., & Jones, G. (1997). Concordances in the classroom: A resource book for teachers. Houston
TX: Athelstan.
Tribble, C. (2001). Corpora and corpus analysis: New windows on academic writing. In J. Flowerdew,
(Ed.), Academic discourse. Harlow, UK: Addison Wesley Longman.
Trzeciak, J., & McKay, S. (1994). Study skills for academic writing. Hemel Hempstead, UK: Phoenix
ELT.
Weissberg, R., & Buker, S. (1990). Writing up tesearch: Experimental research report writing for
students of English. Englewood Cliffs, NJ: Prentice Hall Regents.
Language Learning & Technology
105
Language Learning & Technology
http://llt.msu.edu/vol5num3/curado/
September 2001, Vol. 5, Num. 3
pp. 106-129
LEXICAL BEHAVIOUR IN ACADEMIC AND TECHNICAL CORPORA:
IMPLICATIONS FOR ESP DEVELOPMENT
Alejandro Curado Fuentes
University of Extremadura, Spain
ABSTRACT
Lexical approaches to Academic and Technical English have been well documented by
scholars from as early as Cowie (1978). More recent work demonstrates how computer
technology can assist in the effective analysis of corpus-based data (Cowie, 1998;
Pedersen, 1995; Scott, 2000). For teaching purposes, this recent research has shown that
the distinction between common coreness and diversity is a crucial issue. This paper
outlines a way of dealing with vocabulary in English for Academic Purposes (EAP)
instruction in the light of insights provided by empirical observation. Focusing mainly on
collocation in the context of English for Specific Purposes (ESP), and, more precisely,
within English for Information Science and Technology, we show how the results of the
contrastive study of lexical items in small specific corpora can become the basis for
teaching / learning ESP at the tertiary level. In the process of this study, an account is
given of the functions of academic and technical lexis, aspects of keywords and word
frequency are defined, and the value of corpus-derived collocation information is
demonstrated for the specific textual environment.
INTRODUCTION
The areas of English for Specific Purposes (ESP) and corpus-based lexical studies seem to converge in
the study of terminology (cf. Pedersen, 1995). The main aim in terminology studies is to create
specialised dictionaries that reflect knowledge fields and concepts where these are related to the property
of lexical use restriction.1 In the textual collections, collocations play an essential role in the description
of this specific language usage (Pedersen, 1995, p. 61). In this sense, word combinations work as building
blocks that increase the learner's potential to command special languages.
However, the results of technical collocation studies have little to offer students for academic
performance and achievement: that is, they do not help learners meet the "stylistic expectations of the
academic community" (Cowie, 1998, p. 12). This is because of the fact that in addition to the specialised
terminology, there are other types of combinations that greatly influence the ESP learning context: for
example, seek the objective, consider my suggestion, the theory is canvassed, argue rather less
vehemently, and many other examples of academic discourse (Cowie, 1978, p. 132).
Our approach is precisely based on the distinction between technical and academic word behaviour. We
are influenced by lexicography where this this double perspective is exploited (e.g., Lozano Palacios,
1999) according to whom general academic vocabulary is distinguished from more specific word use.
Lexical levels or categories are fostered and described through the application of corpus-based studies.
The design of a fit corpus is of prime importance so that lexical profiles can be developed effectively.
This means that aspects such as size, type, balance, and integration of texts must be defined from the
scope of ESP. In this line of work, small representative corpora are favoured for specific purposes
(Tribble, 1997, p. 116).
Copyright © 2001, ISSN 1094-3501
106
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
In addition, an electronic concordancer such as WordSmith Tools (Scott, 1996) is rather useful to handle
reduced text collections (Tribble, 2000). This includes dealing with differences between one given genre
and the reference corpus, or between one specific theme and the overall body of subject texts (Scott,
1997). The results obtained are Keywords, which signal the "aboutness" of the texts (Scott, 2000), and
thus receive primary observation in restricted language measurement. General word usage, in contrast, is
derived from lexical surveys across subject boundaries. These are examined through critical concordance
data, also known as KWIC -- Key Word In Context.
With these notions in mind, particular subject areas are represented by specific corpora. The size and type
of the sources can vary, depending on how similar or different the topics are. For instance, related
disciplines within the broad domain of Health Sciences can be grouped together (e.g., Nursing,
Occupational Therapy, Medicine), because they share knowledge fields. Yet, their organisation and
distribution in a specific corpus may present thematic variables due to emphasis on a given branch alone
(e.g., Sports Medicine).
These selection principles are conceived according to interests and priorities in university programs and
syllabi. In this respect, the domain selected in our research includes some current Information Science and
Technology areas, such as Computer Science and Engineering, Optical and Radio Communications,
Librarianship and Information Management, and Audio-visual Communications. These degrees are the
main headings of our subject area sub-corpora; they are also majors that have been recently incorporated
at our university (1995 - 2001).2
Due to the fact that changes take place very rapidly in these disciplines, the texts in the corpus should be
regularly updated. A five-year time margin is recommended by some of our colleagues as a suitable
renewal period. This suggests that we select, for instance, academic textbooks and research articles that
have been published recently. In addition, information obtained from the Internet is favoured, since such
feedback also tends to be up-to-date. This technical material is assessed conveniently, not only for
university studies, but also for future careers where instructions are mostly read in English. As a result,
the selection of the sources is made according to two chief principles: the importance of academic
readings for tertiary level education and the consideration of technical material for both college and work
situations.
The principal objective of this paper is the classification of different lexical categories in English for
Information Science and Technology. In this respect, the basis or point of departure is a lexical common
core, described in contrast with the diversity of word use. Keywords and word frequency constitute the
basic tools for working with this language variation. Collocation information is the main means for
observing these linguistic traits in our context. The notion of collocation pervades this analysis of
technical and academic constructions in ESP development.
METHODOLOGICAL ISSUES
From our viewpoint, the examination of lexical data in small corpora is related to the analysis of specific
purpose languages. This relationship motivates our selection and arrangement of sources according to two
main factors: ESP focal points (internal) and contextual conditions (external).
Under the first parameter, texts are updated in terms of subject matter. Dudley-Evans and St. Johns (1998,
p. 99) claim that this search for novelty is crucial in ESP; the aim is that language reflects current issues
in Science and Technology, where the tendency is for "carrier content" to "date rapidly" (p. 174).
A second internal factor is that material be authentic. This means that texts should be required or
recommended in university courses (James, 1994), and that different genres should be included (Conrad,
1996). For instance, a primary or introductory stage involves textbook discourse -- aimed at fulfilling
learning demands in first and second year university studies (Johns, 1997, p. 46). Then, technical writing
Language Learning & Technology
107
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
in reports contains appropriate language for intermediate levels (Bergenholtz & Tarp, 1995, p. 19).
Likewise, research papers and articles tend to meet the advanced needs of research students (Brennan &
van Naerssen, 1989, p. 202).
The third internal point in our methodology considers textual availability. A priority is that texts are
managed electronically. Documentation in electronic format is needed for concordance procedures.
Therefore, with the increased production of texts in this manner, students can become genuine users of
corpus resources for learning purposes (Johns, 1993). ESP instructors can then work as supervisors of
learner-centred reading tasks. The design of the corpus can be carried out by both instructor and learner
alike; the former directs operations according to language interests, and the latter contributes special
interest topics in the subject area.
Criteria outside the ESP learning context are also applied. These external conditions influence corpus
selection because study programs and syllabi must be accounted for as relevant to subject matter. In our
institution and similar centres in Spain, they offer guidance for the arrangement of the sources. A
contrastive examination of university curricula is encouraged to identify common subjects, taught in more
than one of the four disciplines mentioned: Computer Science and Engineering, Optical and Radio
Communications, Librarianship and Information Management, and Audio-visual Communication. Shared
fields are labeled as subject categories in Table 1, according to the data derived from Information Science
and Technology study programs.3
Table 1. Subjects Shared by Disciplines
A
A1
A2
B
B1
B2
B3
C
C1
C2
C3
D
D1
D2
E
E1
F
F1
F2
F3
F4
F5
F6
Computer Science/Engineering and Optical/Radio Communications
History of computers, Hardware, Software
Computer engineering and architecture, Data communications and Client-server architecture
Librarianship/Information Management, Computer Science/Engineering and Optical/Radio
Communications
Information units management
Online database systems, Computer systems
Automated Knowledge-based systems
Librarianship/Information Management and Audio-Visual Communication
Content analysis
Media documentation
Documentation Legislation
Optical/Radio Communications and Audio-Visual Communication
Media technology
Media theory
Librarianship/Information Management, Optical/Radio Communications, and Audio-Visual
Communication
Communication Theory
All Four Disciplines
Perspectives on Information
UNIX / Internet
HTML, SGML, TEI
Hypertext technology
Electronic publishing
Information infrastructure
Texts are selected according to their relevance in the subjects -- A1 to F6 (Table 1). The reading material
is either offered in the courses, or recommended by content instructors. For example, textbook chapters
on the history of computers, hardware, and software (label A1) are part of the book Computer Language
Language Learning & Technology
108
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
(Díaz & Jones, 1999), suggested as further reading in the named introductory course for Optical/Radio
Communication students. In contrast, a research article like "The Audience as Reader" (Callev, 2000),
belongs as reference material in Content Analysis (subject C1); it provides technical reading that is
helpful for project reports in both Audio-Visual Communication and Librarianship/Information
Management studies.
The inclusion of different academic genres balances the corpus. The goal is to provide a representative
collection of the subject areas in the learning context of our institution, where specific language
competence is mainly demanded for reading and writing. Thus, questions about lexical features are
addressed by using the right corpus (Biber, Conrad, & Reppen, 1998). Figure 1 illustrates how our
sources can be balanced according to genre and subject area synchronisation.
Figure 1. Distribution of sources in corpus
C.S. = Computer Science / Engineering
I.S. = Information Science (Librarianship / Information management)
Tel = Telecommunications (Optical / Radio Communications)
A.Com = Audio-visual Communication
All = All four disciplines
RAs = Research Articles
TXs = Textbooks
RPs = Technical Reports
The disciplines serve as reference for textual selection. According to this notion, each of the four areas
includes an equal number of sources in each genre. Ten research articles, for example, deal with
Computer Science / Engineering topics, drawn from bibliography lists in this discipline. However, the
concepts are also examined in other study programs, such as Optic / Radio Communications. The same
applies to the other cases, where university curricula provide feedback about reading requirements; these
are double-checked by following programs and consulting colleagues in the subject areas.4
Figure 1 presents an additional set of sources: five textbook excerpts and six technical reports. These deal
with the field of Business Technology, which appears as common core in all the subject areas. It cannot
be distinguished as predominant within one single domain, but quite the opposite, it is a complementary
part of all the different areas. Its importance is derived from not only study programs, but also the current
Spanish job market. For instance, a report entitled "The Do's and Don'ts of Technology Planning" (FECT
& NECC Conference, 1999) summarises Information Infrastructure issues (category F6, Table 1), which
are commonplace in careers related to Information Science and Technology.5
The overall corpus does not exceed one million words. The purpose of this limit is to attain specificity. In
this sense, a reduced size demands a precise representation of the specialised language. Figure 2 shows
Language Learning & Technology
109
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
the number of running words (tokens) and distinctive items (types) for each of our genre sub-corpora.
Standardised ratios (types per 1,000 words) are also contrasted.
Figure 2. Word distribution
WordSmith Tools provides the basic functions -- Keywords and Collocates -- which perform likelihood
tests and Mutual Information measurements. These are made on the corpus to generate a quantitative view
of lexical behaviour (cf. Ooi, 1998). Wordlists, another main feature, constitute the cornerstone on which
to start the gathering of data. By cross-tabulating wordlists, keywords are obtained. A given sub-corpus
(e.g., a subject category in Table 1) is contrasted with the overall reference corpus. The resulting group of
words tends to be rather descriptive of the context aimed at. In this respect, the relationship between
lexical items and text seems to be bi-directional, as words serve to identify context, and this, in turn,
influences the particular bonding of elements.
The results derived from this type of analysis are offered in the following section. The measurement of
the data is carried out to observe lexical patterns, and, thus, a convenient classification of words can be
made. Then, in the discussion, the significance of the data is assessed for ESP development.
LEXICAL RESULTS
Lexical findings are examined in context. This means that linguistic input is obtained by observing word
combinations that are meaningful in the subject and genre domains.
We use concordances to reflect the significance of lexical patterns in specific contexts (Firth, 1957;
Halliday, 1966); this implementation constitutes the basis of our work. The contrastive view of the data
provides the necessary conditions to check lexical diversity and uniformity in the corpus. The aim is to
describe genre and subject matter variables. For this analysis, the operations are ordered as follows:
observing, measuring, and classifying lexical data.
To illustrate this analytical procedure, an example is provided with the cluster, provide access to, used
extensively throughout the corpus. We observe this presence in all the genres and in several subjects, and
thus measure its frequency and dispersion in the whole corpus. This is done to make sure that it occurs
significantly as a general expression. In this respect, the assertion made about its classification is based on
empirical factors.
Language Learning & Technology
110
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
Unbiased data are those based on lexical behaviour in context (KWIC), which can reveal how common a
given expression really is. We determine that the relationship between concordance lines and the number
of different sources contained in those lines informs about the type of lexical item. In this sense, the
example provide access to is analysed according to a 0.3% cut-off point, meaning that, above that level,
its occurrence is considered common core in our texts, and, consequently, free and general. This margin
higher than 0.3 % refers to the number of sources in the concordance lines: For every 10 lines of
concordance text, at least three different sources must be involved.
In addition, we consider that the three genres should be present in the total concordance, and that at least
six different subjects must be included. These numbers are considered reliable, since our corpus is not
very large. We believe, in fact, that 95 texts, 17 subjects, and three genres are low numbers in comparison
with bigger corpora, and that 0.3 % is an appropriate measurement as a result.
As an example of this computation, the collocation directory service operations is observed. It is recorded
as key within subject domain A -- belonging to the areas of Computer Science and Engineering and
Optical/Radio Communication. It appears 70 times but only in three texts, yielding a 0.04% contextual
margin. We thus regard this lexical manifestation as specific and restricted, in contrast to the free and
general case of provide access to. Directory service operations actually behaves as a specialised
collocation, in agreement with Pedersen (1995), and, as such, tends to form complex nominal compounds
(see also Varantola, 1984).
Finally, lexical elements that have a high frequency in the corpus, but are predominant within one single
genre, also deserve attention. They tend to operate as restricted word combinations, but do not denote
technical or specialised meanings. Instead, they form compounds of a semi-technical type. An example is
your program directory, appearing in subjects A1, A2, B1, D1, F1, and F4 (see Table 1), and in 14
different sources. However, only the genre of technical reports contains these instances. This specificity
makes these constructions genre-based.
Three main lexical sets thus constitute the object of our study in the results: general elements, specific
items in defined contexts, and genre-based constructions.
General Elements
Detailed Consistency Lists (DCL) are made available through the concordancer. These are wordlists
arranged according to the contrast of frequencies in different domains (e.g., in genres). For the listing of
general academic items, they prove to be rather useful. In our corpus of Information Science and
Technology sources, the DCL is considered an academic word list. It is similar to Coxhead's (1998), since
it presents input that can become quite relevant for English for General Academic Purposes (EGAP).
Most of the lexical data in the DCL includes verbs and nouns, followed, to a lesser degree, by adjectives
and adverbs. An important feature of academic language is that there are more verbs in the past tense and
past participle (e.g., defined, conceived, designed) than there are present or gerund forms. The same
happens, in fact, in Coxhead (1998). In the case of nouns, many correspond to common scientifictechnical instances, such as information, data, Web, HTML, and computer.
Free word combinations result from examining the DCL. For example, the forms associated with the noun
information are widespread throughout the corpus. They are analysed as free collocations, appearing in
contexts that vary significantly in terms of subject matter. They are thus considered semi-technical
elements. Some examples are information system, information technology, digital information, and
information about.
In addition to these frequent items, lexical elements found at the bottom of the DCL, are likewise
important. Despite having a low frequency, they exhibit contextual significance in their behaviour. An
example is the inflected form coined. All six occurrences of this item denote academic use. The pattern of
Language Learning & Technology
111
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
Verb + Noun in the expression coin + the term, surfaces in this sense. It is declared academic because of
its high degree of dispersion, as it shows up in different genres and subjects. These contexts are a
textbook and a research article on information management units, another article on perspectives on
information, and two technical reports on information infrastructure.
The term coined is judged as important in the DCL, as a result, not only due to its wide range of use, but
also to its collocational strength -- denoting a great degree of idiomaticity (Stubbs, 1995). The diversity of
contexts in which it is included makes it idiomatic. In addition, it is the only form in its lexical family
appearing in all three genres and in more than one subject area. From this perspective, general academic
terms can be either frequent or sparse, but they must always present a noteworthy dispersion.
Other examples of low frequency items are the following: where this technology excels, imported into,
select / edit / paste, cable hooked into the, was instrumental in + verb-ing, first and foremost, diskette
drive, compounded by the, relative autonomy, and ticket booth.
They all share the property of general academic vocabulary. A string like was instrumental in + verb-ing,
or the cluster compounded by the, function as common core in our setting. The same occurs to a noun
collocation such as diskette drive or ticket booth. They receive the same treatment, in this respect, as
frequent word combinations, and match in importance the ones mentioned above, for example,
information system, information technology, digital information, and so forth.
Among the examples of low frequency words, however, a distinct type of item emerges, and an
alternative approach is inferred in its classification within the general academic vocabulary. These are the
so-called lexical phrases -- for example, first and foremost. They tend to behave as procedural items in
our context, being closely related to academic use (Stotsky, 1983; Thurstun & Candlin, 1998).6 They also
appear in a wide variety of contexts, functioning as grammar and discoursive markers. Their procedural
status derives from the effect of signposting which they demonstrate in the texts. This characteristic is
analysed as a rhetorical marking of functions and techniques. For example, they may indicate interaction
with the reader, a reference to the text itself or to the investigation carried out.
How these items manifest different rhetorical uses can be checked, for instance, with the behaviour of the
preposition by. Its variation is made plain by a contrastive view at the corpus. The preposition is seen to
denote conventional agent utilisation in many passive clauses, for example, claims made by the text, but it
can also serve as a highly frequent instrumentalisation device, for example, by means of. In addition, it is
commonly used in classification statements such as used by location and used by subject. Finally, it is
often included in descriptive phrases like characterized by and defined by.
This wide range of rhetorical expressions also affects content words. Nouns, for instance, are used in
common clauses like make use of. These noun expressions appear in all three genres. Adverbs can also
function in this way. Some examples are more likely and more appropriately, extensively produced in our
corpus. Despite this inclusion of content words, grammar items such as the combinations mentioned
above with by, prove to be the most extended type of rhetorical devices.
Specific Items in Defined Contexts
The function of Keywords in the electronic concordancer provides the means needed to describe
terminology according to specific textual segments. This procedure is carried out with a given group of
sources selected by subject. For example, topic A1 (History of Computers, Hardware and Software; see
Table 1) is compared with the entire corpus, and word frequencies in both subject and reference
collections are cross-tabulated. The resulting keywords are relevant not only in terms of frequency, but
also textual dispersion. Thus, items like Multics, segment, Minix, bit, ring, segments, and ATM, appeal to
the thematic essence of category A1.
Language Learning & Technology
112
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
This list of keywords demonstrates the importance of "key-ness" scores in WordSmith Tools. The
percentages in this measurement indicate positive and negative keywords. A "key-ness" level above 25%,
in this sense, contains words that are pivotal as subject items ("positive keywords" [Scott, 1997]). Their
identification is made in defined contexts such as thematic sub-corpora, as these terms concentrate the
"aboutness of the texts" (Scott, 2000). Most of these words are nouns, combining as compounds that
weigh heavily on field specialism, and they operate in restricted domains as subject descriptors.
Reference to specific notions is observed in system project manager revision, automation project
manager acceptance, or project manager report. They are examples of key combinations derived from
the collocation project manager, which has a high "key-ness" score in the topic of Automated
Knowledge-Based Systems (heading B3 in Table 1). There are several instances of these long nominal
compounds in our subject texts, which leads us to consider that the longer the noun compound, the more
restrictively it tends to operate in the subject area (Pedersen, 1995).
Specific lexical structures thus reflect technical use, although this is not clear in all cases. For instance,
the noun library is the top keyword in texts about Librarianship. It collocates with nouns that do not
present any semantic complexity, as the instances virtual library staff, connectivity on the library, and
public library community prove. Within subject F6 on Information Infrastructure, in fact, these elements
specify the procedure of electronic information organisation, but do not offer much comprehension
difficulty.
The generality is that most keywords tend to sum up the thematic content of the texts. Several are actually
quite descriptive at first sight, such as images and media in the area of Document Content Analysis (C1),
or copyright and contractor in Document Legislation (C3).
Keywords can also be obtained in the contrastive analysis of two or more disciplines. In this case, they
originate from two lists of subjects, for example, A1 (History of Computers, Hardware and Software) and
A2 (Computer Engineering and Architecture, Data Communications, and Client-server Architecture). The
findings are then identified as broader in scope (for example, applicable to both Computer Science and
Optical / Radio Communications), presenting, as a result, a less restricted subject-based pattern. Some
examples in this thematic group (A1, A2) include bits, hardware, directory, IP, software, and PC. These
items result from contrasting the smaller, theme-restricted context with the overall corpus, as pointed out
above.
The data prove to be crucial for the constitution of lexical profiles in the texts. A similar deduction is
made by working with the four separate areas, that is,, Computer Science / Engineering, Optical / Radio
Communications, Librarianship / Information Management, and Audio-visual Communication. The
lexical information analysed in this case can be valuable as a guide to specialised language, much like
dictionaries and other lexicographic material are (e.g., Collin's Dictionary of Computing, 1999, or the
TERMITE Database of Telecommunications, 1999).
In this respect, our data can be contrasted with authoritative sources to check similarity / variation
features. For instance, in the case of the form abandoned, examined in Collin's Dictionary of Information
Technology (1997), the phrase abandoned the spreadsheet is given. In our sources, the clause the code
had to be abandoned, is similar to that example. This contrastive view is highly recommended from our
perspective, due to the fact that the dictionaries and glossaries handled are recent. They therefore provide
updated material for linguistic analysis.
Table 2 displays the top three words surveyed in this way. The feedback conveys meaningful disciplinebased content, but also diversity of lexical use.
Language Learning & Technology
113
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
Table 2. Main Items in Discipline-Based Sub-Corpora
Computer Science
Program
Data
System
Librarianship/Information
Management
Library
Information
Use
Optical/Radio
Communications
Network
System
Data
Audio-Visual
Communication
Digital
Media
Server
Two large corpora are included in this cross-examination: The first two columns (from left to right)
correspond to James (1994) and Lozano Palacios (1999). The two sources offer ranked positions of
words, which is quite useful for academic purposes, since, as in Coxhead (1998) above, the most frequent
verbs and nouns combine critically.
For instance, Lozano Palacios (1999) reports on Verb + Noun and Noun + Noun patterns. The items are
deemed as essential academic data. The clusters provide + access to, and data collection + techniques,
instruments, methods are two relevant examples. The former is considered a general academic expression
in our corpus, according to the description of the section "General Elements." The latter is specific and
subject-based, appearing more frequently within the F domain. However, the compound data collection +
Noun is not evaluated as strictly technical, mainly due to the common coreness aspect that characterises
setting F (see Table 1).
The two main aspects revised, academic and subject-based language, thus seem to merge in disciplinedriven vocabulary. The effect produced by such words seems neither common nor restricted. These items
would be found at a middle position between general and specialised vocabulary in our study.
Genre-Based Constructions
Constructions that occur across different subjects, but only in one single genre, are also accounted for.
These elements are rather frequent in various texts, much like the discipline-based items examined above.
Nevertheless, these genre-based combinations are namely treated as specific academic language. Some
examples common in technical reports include information object, networked information services, and
information on the Web.
As mentioned in General Elements, the DCL forms the bulk of contrasted vocabulary. The three genres
are compared, and their word frequencies serve to establish measurement references. The items identified
as relevant in these word lists have a high frequency in one genre alone, and, in contrast, very low usage
in the other two contexts. In addition, a Keyword analysis is carried out on the top words of the genre lists
in order to check that the lexical items are actually distinctive in their genre categories. The results are
classified from most to least typical in terms of genre description; the former operate as positive
keywords, and the latter as negative.
An example of a highly positive keyword in the genre of textbooks is requirement. Another is Semiotics.
They represent the two chief types of keywords in this environment: Widely extended across subject areas
in the first case (for example, the following requirement), and restricted to particular subjects in the case
of Semiotics within Audio-Visual Communication.
In a different genre, technical reports, the top keyword library appears quite frequently. It combines
within clusters and compounds more or less familiar, as observed in units like Cable Book Library, the
library's clientele, library program, networking the library, inter-library lending, inter-library loan, and
so forth. In contrast, the noun protection illustrates low frequency items in these reports. Despite its fewer
occurrences, important grammatical forms such as protection from and protection for, can be pinpointed.
Other significant lexical items which occur with protection include fire protection, copyright protection,
protection criteria, and protection levels.
Language Learning & Technology
114
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
In research articles, the top item is project, apparently uncharacteristic and yet, intimately linked to the
research activity. Some frequent collocates show endemic traits in this respect: project deadline, project
work, project milestone, project manager, project revision, and so on.
With the analysis of this data, effective samples are drawn to back up our claims for the next section.
Lexical information is then assessed, and implications for ESP development are reflected upon.
DISCUSSION
In the survey of results, three main divisions of lexical behaviour have been found: General academic
vocabulary that occurs widely across the corpus, with presence in a diversity of topics; elements drawn
from subject matter scrutiny, considered specialised or technical; and genre-driven findings occurring
restrictively, thus characteristic of one single genre.
In this section, our main aim is to assess this lexical co-occurrence in its context to determine validity for
teaching ESP. The process of language acquisition, in this respect, should be evaluated according to the
contextual variables analysed. We approach the description of academic and technical constructions in
our environment by evaluating their use in either a great or small number of texts. In both cases, lexical
units are influenced by subject matter and academic discourse.
Word combination significance is mainly determined through language task application. This means that
effectiveness of data is judged by specific language instruction: "To teach language for the subject
specialism," and "teaching tasks based on the specialized content" (Edwards, 1996, p. 13). This evaluation
leads to categorising lexical data as priority items for ESP and EAP courses (Jordan, 1997).
Eight types of lexical units are consequently devised. They result from the detailed revision of the data in
the previous section, and from how such items serve to fulfil specific language learning demands in our
context. They are classified as follows: common core collocations, rhetorical academic elements,
technical collocations, thematic combinations, area-based general words, area-based specific words,
genre-based academic vocabulary, and genre-based thematic words.
Common Core Collocations
A main group of lexical elements is first inferred by focusing on those words that occur commonly. This
level is measured across subject areas, which constitute a common core foreground where constructions
are used by "authors writing on similar topics" (Stotsky, 1983, p. 438). The items receive a semi-technical
treatment primarily because they are content words conceived, in agreement with Ewer (1983, p. 10), as a
"number of language items which are common to the subjects," or as the "core language."
In our scientific-technical context, this semi-technical degree derives from word behaviour registered at a
general academic stage. The elements become core combinations related to the academic context. In this
sense, they are viewed as
formal, context-independent words with a high frequency and/or wide range of
occurrence across scientific disciplines, not usually found in basic general English
courses; words with high frequency across scientific disciplines. (Farrell, 1990, p. 11)
Academic combinations function as lexical extensions of General English vocabulary in our specialised
corpus. In other words, their meaning is familiar in academic discourse and common in the Information
Science and Technology domain, since the expressions denote events and concepts that characterise this
area. This language is more general than specific because it describes notions and ideas that are
customary in the whole corpus.
As described in the Results above, common core academic elements have high frequency and dispersion
rates across sources, such as in the case of the collocations information technology and digital
Language Learning & Technology
115
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
information. In contrast, lexical items can show low frequencies, but the number of texts included then is
also high by comparison, and the offer of topics is diverse, for example, the aforementioned combination
coined the term. The cut-off point that distinguishes both planes is 0.3 %, as mentioned previously (three
texts for every 10 concordance lines).
Table 3 is an example of a general academic entry. The lemma is address (representing both verb and
noun), and its derived forms are addressed and addresses, which are also considered common academic
words in our corpus. The number of instances is provided near the entry and divided into the three genre
sub-corpus frequencies. Table 3 is organised according to frequency, from highest repetition to least; the
most repeated combination is labelled with the times it occurs (shown in brackets).
Table 3. Example of General Academic Entry in our Corpus
TXs
RPs
RAs
128
317
63
___ space (137)
the same ___
whose ___
must ___
does not ___
the network ___
to ___ this
___ the issue
ADDRESSED
47
15
36
to be ___ (14)
___ in
is ___ by
should be ___
___ here
has ___
can be ___
ADDRESSES
42
33
14
IP ___ (17)
TXs = Textbooks; RPs = Technical reports; RAs = Research articles
BOLD = lemma (most frequent item in its lexical family)
UNDERLINE = word forms (less frequent) derived from the lemma
ADDRESS
As can be deduced, the collocations in Table 3 are based on verb forms (e.g., address + the issue) and
nouns (e.g., address + space). These are content word associations, similar to the ones that the BBI
Dictionary of English Word Combinations (Benson, Benson, & Ilson, 1997) describes. This source
actually includes common academic items from the world of Information Technology: access data,
browse the web, and so forth (Benson, Benson, & Ilson, 1997, p. vii).
The combination of grammar items and verb forms is also evaluated at this level of general academic use.
Some examples are shown in the section of addressed (Table 3) -- for example, addressed in, and
addressed by. These are regarded as general collocations, as a result, since they are found in many
different texts. Grammar constructions are academic collocations in this respect. The BBI Dictionary of
English Word Combinations (Benson et al., 1997) distinguishes grammar from content collocations, but,
in our case, this is not so. In our analysis, grammar combinations work at the same common core plane as
academic collocations.
We deduce our claim from the management of word lists as academic input for ESP development.
Learners are encouraged to carry out lexical profiles when coping with academic reading. Such a chore
implies, as a matter of fact, coming to terms with the DCL in the different genres, pinpointing
constructions that are common core and typical in the texts. The collocations examined in Table 3 are
classified in the form of lexical charts, where the most frequent items are contrasted with less common
constructions.
In this line of work, most learners do not differentiate grammar from lexical combinations. This turns out
positively in our context of science and technology, since undergraduates are not used to making syntactic
Language Learning & Technology
116
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
observations. Actually, students tend to carry out an integrated approach, in which content items function
as main units that may or may not keep company with grammar words.
For example, given a common collocation such as address the issue (Table 3), the lack of a preposition is
learned by building collocation charts. These are exploited from both the readings and the DCL, taking
the whole corpus as reference. In a related exercise, synonyms are explored. For instance, the academic
verb cope with, is examined in combination with the noun the issue. In such a comparison, students value
the collocational strength of the preposition with for the construction, in opposition to address, which
does not demand the colligation.
This drill is performed similarly with low frequency items. The main difference is that patterns are
recognised by working with a small amount of lexical information. Useful combinations are then easily
perceived, as different subjects are encompassed by that occurrence. A previous example, coined the term
illustrates this case. Its free distribution is observed when we can detect that the phrase actually occurs in
various texts. In addition, its common coreness is reinforced by contrasting synonyms such as built the
term or constructed the term, since these are used in fewer contexts. Students can then be guided to value
the more fixed meaning of the verb coin, given the fact that synonymous combinations show a lower
dispersion rate.
Rhetorical Academic Elements
Rhetorical items also demonstrate common core relevance due to their high frequencies and distribution.
They are used as markers of cohesion in the texts, according to the Results section. They tend to convey
procedural usage, a feature that relates them to academic elements. Some of the examples mentioned are
by means of, indicating instrumentalization, and more likely, operating as a token of clarification in the
sentences.
These constructions are classified at the same level as general academic language. Their procedural status
defines them as common core, in agreement with McCarthy (1990, p. 51), and with Hutchinson and
Waters (1981, p, 65): They serve as instruments of coherence and cohesion throughout discourse. Some
procedural nouns functioning this way are the use of and the device which.
This language is analysed under the EAP umbrella, which includes EST (English for Science and
Technology). In this learning framework, comprehension activities are favoured, as they challenge
learners to cope with markers of discourse structure (Flowerdew & Miller, 1997). For academic lectures,
in fact, main and secondary ideas are discerned in the texts by exploiting these markers appropriately. For
Science and Technology discourse, learners demonstrate their comprehension of content by conceiving
appropriate rhetorical boxes (e.g., classifications, explanations, descriptions; Bygate, 1987). These are
often built as a result of the adequate interpretation of rhetorical elements.
A suitable exercise is based on the search for lexical formations containing a common grammar word.
This type of work allows for the exploitation of procedural language in our corpus. It aims to identify
vocabulary that co-occurs typically at the general academic plane. For example, measuring the
occurrences of the preposition by, as examined in the Results above, provides different semantic features
of scientific-technical discourse, for example, denoting functions as agent, instrument, and so forth.
Figure 3 demonstrates another example with the preposition within. The word is analysed in different
contexts so that learners can contrast its different meanings. This task of inducing sense depends on the
main contextual conditions found; some authors refer to this activity as semantic prosody analysis
(Stubbs, 1995). It results from the qualitative observation of common core collocations. In this case, the
expressions and word combinations convey a strong procedural meaning, since they signal the type of
discourse function being used.
Language Learning & Technology
117
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
WITHIN
ELEMENTS
Procedural elements in use
SEMANTIC PROSODY
___ information
___ software / ___ the project / ___ a commercial context
WITHIN + CATEGORY
WITHIN + LOCATION
___ ( + a scheduled time)
__ headings / ___ ( + document )
Figure 3. Inferring procedural meaning in discourse
WITHIN + TIME
WITHIN (INSIDE)
Technical Collocations
The level of technicality in word behaviour is closely related to subject domain. The salient condition is
that elements function uniquely in their corresponding field, describing the restricted setting. An example
is the range of specific combinations identified with the noun network in U-network, access network,
local area network, and so forth. This is examined within the subject of Client-Server Communication
(category A2 in Table 1). The items thus allude to concepts and developments in specialised areas, and
their interpretation demands conceptual knowledge. In addition, abbreviations are often key in this
context, which is also evidence of the specific understanding that is required at this learning stage, for
example, bit ASCII, LAN distribution, and GIF and JPEG files.
Conceptually restrained, technical vocabulary is formed by collocations that introduce specialised
knowledge in ESP. The identification of this special language is made by inferring idiomatic
constructions from concordance samples. The aim is to perceive the fixation of long compounds, and to
appreciate the value of this lexical restriction in the subjects.
Figure 4 displays the technical collocations of object, a critical noun in the setting of Data
Communications (category A2 in Table 1). An important collocation like object-oriented is first
underlined by focusing on the most frequent word that goes with object in this context. Then, objectoriented features is also marked as important, given its high co-occurrence probability. Finally, according
to the Mutual Information scores, the phrase use of object-oriented features is recorded in this technical
scope.
Figure 4. Concordance sample for the noun object in a technical setting
Language Learning & Technology
118
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
We can guide learners to explore technical terminology by encouraging this data classification. An
example is the relation of the collocations hierarchically. This can be achieved as follows:
OBJECT (subject: Data Communications)
Object
Object-oriented
Object-oriented features
Use of object-oriented features
In order to determine which combinations are productive, students operate with restrictive lexical charts.
For instance, a useful type of activity is a filling-in-the-gap exercise where the ability to specify technical
items is fostered. Learners pinpoint the restricted elements of subject texts by building tables such as the
one shown in Figure 5. In this case, coming to terms with the central collocate in four different
combinations demonstrates technical language command.
Figure 5. Fill-in-the-gap exercise with technical collocations
In Figure 5, learners take subject B3 as reference for lexical work (texts about Automated KnowledgeBased Systems). The answer to the central node can be found by working with the language of these
sources. This means that students must revise concordance material and context as indicated in Figure 4.
The word in the blank -- management -- can be realised after key technical input is correctly sifted.
Thematic Combinations
Semantic features are examined in technical words by inspecting the subject context. However, exploring
the field of knowledge does not always lead to the description of specialised combinations, according to
our data. There are forms of lexical behaviour, in fact, which occur critically in the thematic environment
but do not classify as technical collocations. These are content words with a less complex level of
comprehension, namely due to their greater familiarity in the world of Information Science and
Technology. Some examples given in the Results are virtual library staff, connectivity on the library, and
public library community, included in subject F6 (Information Infrastructure).
Other examples reflect the register of a subject in a clear way. For instance, the legislation language of
category C3 is clearly revealed through key clauses such as the contractor shall and copyright law (see
Results). The constructions are either specific clusters or multi-word units that identify the subject under
analysis. Other elements can be located in the area of Content Analysis (C1), where mass media and of
the mass media operate as typical constructions within their thematic setting. In addition, like technical
collocations, these items are almost exclusive of their domain and thus seldom found in any other part of
our corpus.
At this level of thematic combinations, we also find lexical data that is characteristic of a related group of
subjects, that is, within a major heading from A to F in Table 1. As a result, the language items described
in this case are not as precise as technical collocations. For example, key combinations are analysed in the
space where Computer Science and Optical/Radio Communications meet (category A). The results refer
Language Learning & Technology
119
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
to computer and network issues, but do not posit much technical difficulty. Some of these items are
computer program, hardware and software, bits per second, and the interface shall provide.
The analysis is therefore based on language that is segmented according to main subject categories. The
keywords that emerge from this task are evaluated in terms of technical use (as was done in the section
Technical Collocations). In this observation, we notice that restrictive collocates are not detected. In
contrast, a wider possibility of combinations is offered. For example, the synonym computer application
is found alongside computer program in subject domain A, although the former is used less frequently.
This aspect of thematic combinations suggests activities where learners can make lexical decisions. These
are based on choices by which synonymous thematic word combinations are explored. Students are given
the freedom to decide which structures best fit the topic areas. Some possibilities are offered in Table 4
for the shared background of Computer Science and Optical/Radio Communications (letter A in Table 1).
Table 4. Investigating Synonymous Thematic Combinations
COMPUTER & TELECOMMUNICATIONS WORDS
Computer program
The interface shall provide
A string of bits
Possible links
Multiple processes
Piece of software
=
=
=
=
=
=
The method presented in Table 4 is a constrastive view with synonyms located anywhere in the corpus.
Thematic combinations are thus distinguished from common core items, such as computer program
versus computer application or the interface shall provide versus the interface provides. Lexical
evaluation is then possible by contrasting thematic constructions with general use. In this manner, the
level of specification of the former can be appreciated.
The same is applied to other textual segments, for example, to subject items that are not technical. The F6
category examples, for example, virtual library staff or connectivity on the library, are assessed in this
manner, being replaced by common core options like library personnel and connecting virtual libraries.
The purpose of this work with thematic vocabulary is to value concordance data in different textual
positions. In other words, the goal is to train learners in the aspect of lexical variation, which encourages
operation according to context; this is a consistent position from our viewpoint at all levels.
Area-Based General Words
The goal at this stage is to describe how language develops within a single discipline of Information
Science and Technology. The items are familiar in all four areas, but expound a characteristic tone in a
particular one. An example is provided by the cluster provide + access to. This structure appears freely
throughout the whole corpus, but it receives greater emphasis in Librarianship and Information
Management, where its semantic prosody is revealed as provide + access to + documentation; this
behaviour is confirmed in Lozano Palacios (1999).
Area-based lexical features contribute to enhancing academic word usage. As described in the section
Common Core Collocations, common core academic elements are exploited in EGAP (English for
General Academic Purposes). The same can be done in the case of area-based general constructions, since
this input may be similarly used in EGAP courses. Such similarities at both common core and area-based
levels are contrasted by means of specialised dictionaries, glossaries, and corpora about the academic
Language Learning & Technology
120
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
disciplines. These sources can supply linguistic information that allows learners to pinpoint similarities
and differences.
Managing and making sense of dispersion plots, available through WordSmith Tools (Figure 6), can also
be enriching for learners. The plots signal where certain items crop up in the texts, thus challenging
students to cope with visual data. For example, the noun access manifests a high concentration in sources
dealing with the area of Librarianship. Figure 6 shows this lexical clustering in rows 1, 2, and 3 -corresponding to text files of reports (row 1) and textbooks (rows 2 and 3). The concordancer can then
disclose whether access is, in fact, used as a noun in these contexts.
This activity should enable students to check, on their own, the high frequency of provide access to in our
corpus. The dispersion plots help them to clarify that it is actually emphasised in Librarianship;
concordance feedback do the same by allowing learners to examine the semantic prosody +
documentation. The expression is consequently conceived as a general academic expression due to its
common coreness; yet, students notice that it is more heavily used in the context of Librarianship Studies,
denoting a special meaning.
The DCL (Detailed Consistency List) of the four disciplines included in our corpus also makes areafocused lexical use easy to perceive. Through this list, the frequency of access is seen as higher in
Librarianship / Information Management texts (see Table 5).
Figure 6. Use of dispersion plots by learners for visualisation of lexical concentration
Language Learning & Technology
121
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
Table 5. Availability of Frequencies in Discipline-Based DCL
N
Word
Files
Audio
Computer
Library
68
Access
4
158
145
415
N = position occupied by word in DCL according to frequency and distribution in files
Telecom
345
Area-Based Specific Words
The approach described in the section Area-Based General Words is geared towards practical work at the
academic level of ESP. The goal is to foster guidance through particular lexical fields in area-based
language. This turns out to be particularly useful in EGAP, where getting familiarised with reading
material, for instance, becomes a powerful resource. Inspection of word use at this level also demands
knowledge of specific concepts in the areas. In this sense, our study includes attention to particular lexical
groups in the areas, where learners should thus specialise. An example of these items, examined in the
Results, is the unit data collection, co-occurring with specific nouns such as techniques, instruments, and
methods. These prove to be fixed word combinations, given their high co-occurrence rate.
The focus is then placed on English for Specific Academic Purposes (ESAP) learning (cf. Jordan, 1997).
For ESP development, constructions involve a view into concept from this perspective. The approach is
made as a response to specific queries regarding a subject area. For example, data collection techniques
refers to the standard means of gathering data in Librarianship.
The underlying fact is that we investigate context, in this respect, to explore concepts. The activity
demands learners examine conceptual paragraphs (cf. Trimble, 1985) that explain notions and clarify
technicalities alluded to by the terminology. This contextual information can be exploited for task
development; it constitutes support material, for example, for research preparation, that is, doing project
reports in English.
Figure 7 presents a set of conceptual paragraphs taken from our corpus. Learners may use them for taskbased research. The excerpts are assessed according to specific learning needs. They can then serve as
complementary or illustrative material for project reports (e.g., as examples/passages to give in oral
presentations).
A range of methods were employed to analyze data from the various data collection instruments.
Quantitative data from the questionnaires, logs, training assessment, etc. were coded and entered in a
spreadsheet for analysis. The techniques used to analyze these data relied primarily on computing
averages and frequencies.
Develop and test a range of data collection instruments related to measuring the impact of Internet
connectivity. Ultimately, the evaluation aspect of the project became the means by which a final report
was developed for use by other public librarians and policymakers.
Phase 2 was intended to direct the development and administration of the various data collection
instruments: What is the value of network connectivity for rural libraries? How does the installation and
use of a network connection have impact on library staff, organization, and service provision? What
groups in the user community benefit from the network connection?
Figure 7. Instances of specific concept development in an area
Genre-Based Academic Vocabulary
This group is determined from academic discourse study. However, unlike general words in Common
Core Collocations, or area-related elements in Area-Based Specific Words, the identity of this level is
based on the conception of genre. Awareness of genre features, in this respect (cf. Jordan, 1997), is the
Language Learning & Technology
122
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
prerogative. This is confirmed, for instance, in the case of advanced learners who are required to perform
well in writing assignments.
An example of a lexical item in the genre focus is project, as mentioned in Results. This noun is often
used in research papers to refer to the investigation being described. In a Computer Science setting, for
instance, project is stressed in project deadline, project manager, project work, and so forth (see Results).
The items become common in the genre of research articles. A relevant activity is to contrast genre-based
instances such as these with general academic elements. The comparison aims to refine the view into
genre-focused words, while general academic items are explored in the whole corpus. Table 6 provides an
illustration of this comparative task in research articles.
GENRE-BASED (ARTICLES)
project members
design process
search test
GENERAL ACADEMIC
for the project
the design of
of the search
Table 6. Contrastive View of Research Article Items with Common Core Elements in Our Corpus
Lexical variation is thus visualised in the genre. The recognition is beneficial for learners aiming to
develop effective writing skills. In particular, technical composition within the genre is enhanced by
means of specific genre features. Coping with this should enable the ability to adapt to the conventions of
an area like Information Science and Technology. In this regard, we favour an ESAP methodology for
genre-based academic items, while the focus is also placed on scientific-technical writing.
Genre-Based Thematic Words
This last category also considers genre awareness as the main scope. The procedure by which this set of
items is established falls under the ESAP application. This means that specific language is exploited in
tasks designed to make genre features familiar. In addition, thematic influence fortifies the genre-based
lexical focus on academic and technical purposes.
An example mentioned in the Results is the noun Semiotics. It surfaced in textbook chapters about
Content Analysis (heading C1). Academic lectures on this subject offer language greatly influenced by
theme. A course in our institution integrates these lessons on Semiotics in Audio-visual Communication
and Librarianship/Information Management studies. The lectures encourage the elaboration of summaries
and reports, for which familiarisation with typical collocations and structures in the setting becomes
beneficial.
Learners apply their note-taking skills to listening and writing activities derived from the lectures. Figure
8 reproduces a short extract of a lecture on semiotic elements, given by an American visiting professor at
our school in 1997. Content comprehension is then tested in activities (Figure 9).
Today's topic deals with the fundamentals of all visual communication …. These are basic elements,
[pointing to the slide] … these are the compositional source of all kinds of visual materials, … for
example, the messages, the objects and … the experiences as well … In this way, … we have that the
most basic element is the dot, … which can be defined as a pointer, a marker … a marker of space … the
other element is the line …. This is an articulator of form, … that is, a design item for making a technical
plan, … so it designs the form intended, … ok; … another element that we can think of is the shape,
which is the basic outline, ….
Figure 8. Lecture excerpt on semiotic elements
Language Learning & Technology
123
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
1st element:
Definition:
3rd element:
Examples:
4th element:
Example:
5th element:
Contrast:
6th element:
Reference:
th
8 element:
th
Classification:
11 element:
Exemplification:
General Field of elements
General function of elements in
field
Concept of understanding
elements
Example
Figure 9. Example of an activity with specific subject lecture
The textbook and lecture genres are thus exploited in this course of second year Audio-visual
Communication and Librarianship students. Key lexical items are pinpointed as traits of genre-based
thematic language. Learners have the option to experience this language by both textbook reading and
lecture note-taking. Some examples are visual elements, Semiotics components, signs and codes, and the
basic element (see Figure 9).
We must clarify that these items are not restricted to one given genre. In other words, not only general
elements but also specific items may appear in other genres. However, the words are more descriptive of
the context being dealt with. For instance, the data explored in the course (Figures 8 and 9) reflects the
typical language of the Semiotics subject, expounded through lectures and textbooks. The concepts
needed in that content lead to seeking these specific items, belonging to the mentioned topic of Semiotics
and to no other.
The integration of genre and subject fosters content-based instruction, an important point in ESP learning.
The approach focuses on corpus material, developed with different educational levels in mind. An
example is that of learners in second year courses of Librarianship and Audio-visual Communication
having to cope with the mentioned genres of textbooks and lectures.
The assessment of our data is proposed as a practical view of ESP from an academic and subject scope. In
this sense, it is not an exhaustive view of word behaviour, but an applied one for subject area courses. In
the following section, we revise this and other relevant claims made.
CONCLUSIONS
The principal aim in this paper has been to provide evidence that supports the distinction of common core
language from restricted lexical behaviour. A central assumption is that two separate levels exist in our
sources: academic and technical. Nevertheless, inferred from lexical classification in our specific corpus,
both planes are divided into further categories of word use. These encompass regions of lexical use where
academic and technical elements apparently coalesce (however, just in appearance, as has been observed,
since specialised use is finely specified).
In a corpus that is representative of both academic and technical material in our selected areas, seeking
lexical behaviour patterns is primarily done according to contextual parameters. This is achieved by
applying genre and subject variables. The aid of study programs and university curricula becomes
Language Learning & Technology
124
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
essential in this respect, while applying ESP principles is required for consistency. The chief purpose is to
collect texts that meet language and content demands in our setting.
Our approach to the data includes empirical observation, classification, and assessment of lexical patterns.
In this process, measurement is carried out quantitatively, that is, in the form of absolute and relative
word frequencies. These are essential reference statistics used for contrastive analysis: They serve as
point of departure in the contextual study. Keywords then play a decisive role for lexical profiles, which
demand a qualitative treatment of the data. This means classification of patterns based on frequency and
dispersion.
As a result, in the analysis data is assessed as either occurring broadly across texts, or more narrowly
within certain sources. The results propose three main types of lexical behaviour based on this: Common
core elements in the whole collection, specific words in themes and topics, and elements that are
characteristic of only one genre. The three are surveyed through analytical steps related to ESP notions:
Settings are defined and described according to specific learning needs.
In the evaluation of lexical information, academic and technical word behaviour is discussed. Eight
categories are induced by investigating the relationship between concordance data and context. The way
in which these language peculiarities are developed affects our approach to ESP courses.
Common core elements are divided into general academic items and procedural words. Both demonstrate
a widespread distribution throughout the corpus, and are subject-independent. This makes us consider
them as semi-technical vocabulary. They include content and grammar items that have either a high or
low frequency in the corpus. Their function is inferred for EGAP (English for General Academic
Purposes) teaching, mainly through the application of academic tasks, for example, using wordlists to
point out lexical data in readings. EAP (English for Academic Purposes) thus motivates our work with
EST (English for Science and Technology).
Procedural items are common core constructions that mark cohesion in academic discourse. This is a
main characteristic in general academic writing as well as in lectures. Their organisation in discourse
facilitates comprehension. In contrast with general academic collocations, procedural elements include
grammar combinations that have a semantic prosody related to the organisation of discourse.
Regarding subject-based formations, the degree of restriction in the collocations influences the lexical
divisions made. In the case of technical vocabulary, combinations are quite fixed. The elements are highly
restricted in their behaviour, meaning that they exhibit a consolidated use in the thematic setting where
they are identified as key. Through detailed revision of concordance lines, technical compounds are
examined within longer phrases. This description is done in a manner resembling specialised dictionarymaking, where key constructions function as descriptors for the subject area.
Concordance observation is also useful for underlining thematic influence on those collocations that are
not strictly technical. These are valued as significant feedback in the subject area, but denote a less fixed
behaviour. This means that they can be replaced by synonymous expressions without making a significant
change. However, their use is characteristic in certain subjects, and not in others. In this sense, even
though they tend to be easy to understand, they are also considered specific of the subject area.
Discipline-based elements are also distinctive in the subject area. They would be found in the middle
ground between general academic expressions and specific language. In this respect, they are treated as
common lexical items, identified in different areas, but prevailing in only one. They are conventional
within the discipline, referring to aspects that are frequent and widespread. In EGAP development, work
with these elements enhances the use of academic language for particular areas.
A different case is the lexical data that refers to concepts exclusive of only one discipline. In that
situation, ESAP is favoured: Tasks challenge learners to cope with specific content in their studies.
Language Learning & Technology
125
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
Knowledge of technical issues is fostered through activities that demand exploitation of conceptual
paragraphs, for example, by elaborating oral reports.
Finally, lexical features are analysed in the genre context. In addition, as emphasis is placed on subjects,
the genre setting includes thematic items. Both subject and academic elements raise genre awareness in
this context. This is especially useful for ESAP writing performance. Genre-based items can present
restricted patterns of lexical behaviour, developed within one single genre, or even topic. The elements
behave as descriptive items, but the difference is that they may do so in the overall genre, regardless of
topic influence, or in a specific subject conveyed through particular genre conventions.
The information obtained and described in this article is therefore assessed for ESP development.
However, it is not intended as theory on lexical behaviour in academic and technical contexts. On the
contrary, its validity highly depends on practical factors which lead to the design of specific corpora.
Large textual collections can serve as reference for the analysis of our data, but do not meet specific
learning demands as fittingly as one's own corpus can. In fact, we believe that none but representative
material in the teaching environment can really fulfil specific language requirements.
NOTES
1. The sources may either include one major discipline, such as the Dictionary of Computing (Collin,
1999), or more than one area, as is the case with the Dictionary of New Media: Film, Television,
Print, Digital, Internet, Multimedia (1999).
2. The Spanish titles are "Informática técnica" and "Ingeniería Informática," "Sonido e Imagen,"
"Biblioteconomía y Documentación," and "Comunicación Audio-visual" (see University of
Extremadura Web page at http://www.unex.es/).
3. The university curricula consulted in Spain (in addition to our own institution) are as follows: For
Computer Science and Optical and Radio Communications, Facultad de Informática, Universidad
Politécnica (Madrid), Universidad Politécnica de Valencia, and Universidad de Vigo (Departamento
de Teoría de la Señal y Comunicaciones). For Librarianship and Information Management, Facultad
de Biblioteconomía y Documentación (Universidad de Granada). For Audio-visual Communication,
Instituto Universitario del Audio-visual (Universitat Pompeu Fabra, Barcelona) and Facultad de
Ciencias de la Información (Universidad Complutense de Madrid).
4. Guidance offered by content instructors is highly valued in the process of textual selection. In
addition, as mentioned above, advanced learner's knowledge can produce positive results. Internal
(ESP) approaches can thus benefit from these external factors provided by the institution.
5. In fact, the elaboration of a broader corpus that incorporates business texts leads us in such a
direction: to integrate material that is generally useful for information technology majors as well as
business students, as they cope with common issues and concepts.
6. Stotsky (1983, p. 438) refers to "words that contribute to cohesive ties in academic discourse ...
usually the content words generated by authors writing on similar topics." These words are also
common core, offering greater difficulty to non-native or overseas students because they are "often
abstract and / or complex."
Language Learning & Technology
126
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
ABOUT THE AUTHOR
Alejandro Curado teaches English for Computer Science and Telecommunications at the Polytechnic
School at University of Extremadura (Spain). His doctoral thesis (2000) presents lexical findings
according to genre and subject in specific settings. His research aims to integrate both discourse and
corpus-based lexical approaches to teaching ESP.
E-mail: acurado@unex.es
REFERENCES
Benson, M., Benson, E., & Ilson, R. (1997). The BBI dictionary of English word combinations.
Amsterdam: John Benjamins.
Bergenholtz, H., & Tarp, S. (1995). Manual of specialised lexicography. Amsterdam: John Benjamins.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics. Investigating language structure and use.
Cambridge, UK: Cambridge University Press.
Brennan, M., & van Naerssen, M. (1989). Language and content in ESP. ELT Journal, 43 (3), 196-205.
Bygate, M. (1987). Speaking. Oxford, UK: Oxford University Press.
Callev, H. (2000). The stream of consciousness. Film-Philosophy, 4(11). Retrieved August 15, 2001,
from the World Wide Web: http://www.film-philosophy.com/vol4-2000/n11callev.
Collin, S. (1997). Dictionary of information technology. London: HarperCollins.
Collin, S. (1999). Dictionary of computing. London: HarperCollins.
Conrad, S. (1996). Investigating academic texts with corpus-based techniques: An Example From
Biology. Linguistics and Education, 8, 299-326.
Cowie, A. P. (1978). The place of illustrative material and collocations in the design of a learner's
dictionary. In honour of A.S. Hornby. Oxford, UK: Oxford University Press.
Cowie, A. P. (1998). Introduction. In A. P. Cowie (Ed.), Phraseology: Theory, analysis and applications
(pp. 1-38). Oxford, UK: Clarendon Press.
Coxhead, A. (1998). An academic word list. English Language Institute Occasional Publication No 18.
New Zealand: Victoria University of Wellington.
Díaz, J. C., & Jones, M. (1999). Computer language. Madrid: UNED.
Dictionary of new media: Film, television, print, digital, Internet, multimedia. (1999). New York:
Readfilm.
Dudley-Evans, T., & St. Johns, M. J. (1998). Developments in ESP: A multidisciplinary approach.
Cambridge, UK: Cambridge University Press.
Edwards, P. (1996). The LSP teacher: To be or not to be? That is the question. AELFE (Asociación
española de lenguas para fines específicos), 9-25.
Ewer, J. (1983). Teacher training for EST: Problems and methods. The ESP Journal, 2, 9-31.
Farrell, P. (1990). A lexical analysis of the English of electronics and a study of semi-technical
vocabulary. Dublin: Trinity College.
Language Learning & Technology
127
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
FECT & NECC Conference (1999) Excerpts of Paper “The Do’s and Dont’s of Technology Planning”
Retrieved August 15, 2001 from the World Wide Web: http://fetc.state.fl.us/.
Firth, J. R. (1957). A synopsis of linguistic theory. 1930-1955. In J. R. Firth (Ed.), Studies in linguistic
analysis (pp. 1-55). Oxford, UK: Basil Blackwell.
Flowerdew, J., & Miller, L. (1997). The teaching of academic listening comprehension and the question
of authenticity. English for Specific Purposes, 16(1), 27-46.
Halliday, M. A. K. (1966). Lexis as a linguistic level. In C. E. Bazell, J. C. Catford, M. A. K. Halliday, &
R. H. Robins (Eds.), In memory of J. R. Firth (pp. 148-162). London: Longman.
Hutchinson, T., & Waters, A. (1981). Performance and competence in ESP. Applied Linguistics, 2(1), 5669.
James, G. (1994). English in computer science. A corpus-based lexical analysis. Hong Kong: Longman.
Johns, A. M. (1997). Text, role and context. Cambridge, UK: Cambridge University Press.
Johns, T. (1993). Data-driven learning: An update. TELL & CALL, 3, 23-32.
Jordan, R. R. (1997). English for academic purposes. Cambridge, UK: Cambridge University Press.
Lozano Palacios, A. (1999). Vocabulario para los estudios de Biblio-documentación [Vocabulary for
library science and documentation studies]. Granada: Servicio de publicaciones, Universidad de Granada,
Facultad de Biblioteconomía y Documentación.
McCarthy, M. (1990). Vocabulary. Oxford, UK: Oxford University Press.
Ooi, V. B. Y. (1998). Computer corpus lexicography. Edinburgh: Edinburgh University Press.
Pedersen, J. (1995). The identification and selection of collocations in technical dictionaries.
Lexicographia, 11, 60-73.
Scott, M. (1996). WordSmith. Oxford, UK: Oxford University Press.
Scott, M. (1997). PC analysis of key words and key key words. System, 25(1), 1-13.
Scott, M. (2000). Reverberations of an echo. In B. Lewondowska-Tomaszczyk & P. J. Melia (Eds.),
Practical applications in language corpora. Frankfurt: Peter Lang.
Stotsky, S. (1983). Types of lexical cohesion in expository writing: Implications for developing the
vocabulary of academic discourse. College Composition and Communication, 34(4), 430-446.
Stubbs, M. (1995). Collocations and semantic profiles: On the cause of the trouble with quantitative
studies. Functions of Language, 2, 23-55.
Termite Database (1999). ITU Global Directory Telecommunication Terminology.
Thurstun, J., & Candlin, C. (1998). Concordancing and the teaching of the vocabulary of academic
English. English for Specific Purposes, 17(3), 267-280.
Tribble, C. (1997). Improvising corpora for ELT: Quick and dirty ways of developing corpora for
language teaching. In B. Lewandowska-Tomaszczyk & P. J. Melia (Eds.), Practical applications in
language corpora (pp. 106-118). Lodz, Poland: Lodz University Press.
Tribble, C. (2000). Genres, keywords, teaching: towards a pedagogic account of the language of project
proposals. In L. Burnard & T. McEnery (Eds.), Rethinking language pedagogy from a corpus perspective
(pp. 75-90). Frankfurt: Peter Lang. Retrieved August 15, 2001 from the World Wide Web:
http://ourworld.compuserve.com/homepages/Christopher_Tribble/Genre.htm.
Language Learning & Technology
128
Alejandro Curado Fuentes
Lexical Behaviour in Academic and Technical Corpora…
Trimble, L. (1985). English for science and technology: A discourse approach. Cambridge, UK:
Cambridge University Press.
Varantola, K. (1984). On noun phrase structures in engineering English. Turku: Annales Universitatis
Turkuensis.
Language Learning & Technology
129
Language Learning & Technology
http://llt.msu.edu/vol5num3/mollering/
September 2001, Vol. 5, Num. 3
pp. 130-151
TEACHING GERMAN MODAL PARTICLES:
A CORPUS-BASED APPROACH
Martina Möllering
Macquarie University, Sydney
ABSTRACT
The comprehension and correct use of German modal particles poses manifold problems
for learners of German as a foreign language since the meaning of these particles is
complex and highly dependent on contextual features which can be linguistic as well as
situational. Following the premise that German modal particles occur with greater
frequency in the spoken language, the article presents an analysis which is based on
corpora representing spoken German. The concept "spoken language" is discussed
critically with regard to the corpora chosen for analysis and narrowed down in relation to
the use of modal particles. The analysis is based on the following corpora: Freiburger
Korpus, Dialogstrukturenkorpus, and Pfeffer-Korpus. In addition, a collection of
telephone conversations (Brons-Albert, 1984) was scanned into computer-readable files
and analysed using MicroConcord (Scott & Johns, 1993). A quantitative analysis was
carried out on all corpora. The qualitative analysis was limited to the telephone
conversations and looks at the constraints on and functions of the different occurrences of
the form eben.
INTRODUCTION
Discourse particles occur in a variety of languages and have been analysed in great detail for the English
language by Schiffrin (1987). Particles of the modal particle type are prevalent in West-Germanic
languages: Dutch, Frisian, and German (e.g., de Vriendt, Vandeweghe, & Van de Craen, 1991; Abraham,
1991a for the link between German, Frisian, and Dutch; Aijmer, 1997, for Swedish). Research interest in
German modal particles arose in the late 1960s with the advent of a more pragmatically oriented approach
to linguistics. They started to shed their image as superfluous, stylistically dubious "fillers" that had to be
avoided in "proper German" (Busse, 1992). Since Kriwonossow's (1963, first published in 1977) and
Weydt's (1969) seminal studies on German modal particles, a large body of work on the subject has
emerged. In those publications, different terms are used for the words that are here described as "modal
particles." Thus, we find for example, "flavouring words" [Würzwörter] (Paneth, 1981), "intentional
particles" [Intentionale Partikeln] (Rall, 1981), "pragmatic particles" (Held, 1983), "discourse particles"
(Abraham, 1991b) and "toning particles" [Abtönungspartikeln] (Helbig, 1994), the term which together
with the German "Modalpartikel" (Thurmair, 1989) is the most commonly used. In a number of
publications (Dalmas, 1990, 1992; Rudolph, 1991), however, the word particle is used without further
specification.
The term particle stems from a structural approach to categorising the various parts of speech into word
classes based on the inflexional properties of words. In accordance with this morphological criterion, the
term particle is often used to refer to "non-declinables," that is, in German, the large group of words that
cannot be considered as part of the word classes noun, adjective, verb, article, or pronoun. In this sense,
particles may be adverbs, conjunctions, prepositions, interjections (Helbig, 1994), sentence adverbs
(Thurmair, 1989), and particles in a narrower sense:
Copyright 2001, ISSN 1094-3501
130
Martina Mollering
Teaching German Modal Particles...
Particles as Word Class
A word like aber, for example, which is a particle in the broader sense as it cannot be inflected, can be
categorised as a member of the word class conjunction as well as of the class particles in a narrower
sense, specifically, as modal particle (e.g., Bublitz, 1977) depending on the linguistic context in which it
occurs. Thus, in a word class definition, the words considered as modal particles all have at least one
homonym in another class or subclass, depending on the model of categorisation (for a critical discussion
see, e.g., Helbig, 1989). In the research literature the term particle is commonly used in its narrower
sense, excluding the other groups of non-declinables. The word class particle in the narrower sense is
then seen to include subcategories, modal particles being one of them. The following subcategories have
been described (Helbig, 1994, p.31):
A plethora of publications within different theoretical frameworks have dealt with the pragmatic and
discursive functions fulfilled by modal particles. These functions are described, for example, in terms of
the management of interaction (Franck, 1979), as constituting consensus (Lütten, 1979), as a guidance for
the hearer (Rehbein, 1979) and as playing a part in establishing text coherence (Rudolph, 1989). There is
agreement, though, on the fact that the function of German modal particles is illocutionary and
interpersonal rather than propositional. In very general terms, modal particles indicate the speaker's
attitude towards the utterance as well as the intended perception on the part of the hearer. Modal particles
may point to the interlocutors' common knowledge, to the speaker's or listener's suppositions and
expectations, and they may create cohesion with previous utterances or mark the speaker's evaluation of
the importance of an utterance (e.g., Abraham, 1991a, 1991b; Helbig, 1994; Thurmair, 1989). However,
foreign language learners of German do not properly understand modal particles and rarely use them
(Möllering & Nunan, 1995). This reflects a lack of sensitivity to an important feature of German
communication, which might lead to misunderstandings and/or misinterpretations.
Modal Particles in Second Language Acquisition
Research findings (Husso, 1981; Rall, 1981; Steinmüller, 1981; Weydt, 1981) provide an ambiguous
picture of the relationship between language acquisition in general and the acquisition of modal particles,
Language Learning & Technology
131
Martina Mollering
Teaching German Modal Particles...
but there is agreement on a much lower frequency of use by non-native speakers. Learners who received
instruction in German as a foreign language did not perceive the communicative value of particles as very
high (Harden & Rösler, 1981; Möllering & Nunan, 1995). Research findings on the acquisition of modal
particles in uninstructed contexts (Kutsch, 1985; Cheon-Kostrzewa & Kostrzewa, 1997a, 1997b) have
shown that the acquisition process is influenced by the fact that each particle is used in a variety of
functions. Particle functions are acquired in an accumulative manner over a long period of time. The
distinction between modal particles and their homonyms is therefore a major teaching objective (see also
Busse, 1992). Research findings on the teaching of pragmatic language features in general (see Kasper,
2000, for an overview) have provided promising results which allow for the hypothesis that explicit
instruction of different particle functions could accelerate and enhance the acquisition process. The
approach to teaching modal particles I would like to propose here is concerned with learners'
comprehension of modal particle meanings in context. Research in interlanguage pragmatics has shown
that teaching pragmatic features of language is facilitative and necessary when input is lacking or less
salient and that explicit instruction is particularly effective in the area of consciousness raising (Kasper &
Rose, 1999, p. 96-97). The concept of "consciousness-raising" (e.g., Rutherford, 1986) refers to the
refinement of learners' metacommunicative awareness, that is, their ability to judge the relationship
between a form and its meaning in context. It is this type of awareness that needs to be honed for a learner
to comprehend the intricacies of particle meanings. With McCarthy and Carter (1994), I would like to
argue that language awareness is not necessarily best taught by direct input language teaching:
That is to say the normal presentation-practise-production cycles should not be seen as
binding for all features of discourse, and in the case of [discourse] markers, these would
seem to be a feature best handled by other types of activity: language-observation
activities, problem-solving, perhaps cross-linguistic comparisons. (p. 68)
The approach I would like to propose is based on authentic language data as collected in a number of
corpora of spoken German. Rather than providing the learner with a list of grammatical particle functions
supplemented by examples on the sentence level (e.g., Helbig & Helbig, 1995), an analysis of such
corpora yields examples of particles in context. With the use of concordancing procedures, patterns of
collocation can be established and made salient for learners of German.
Non-native speakers might perceive German speech acts such as "request" or "voicing of opinion" as very
direct (Rall, 1981) if they merely look at the syntactic mode of the encoding of a particular speech act
without perceiving the modifications brought about by the use of modal particles (House & Kasper,
1981). The following example might illustrate this:
a) Es ist nicht einfach, dieses Problem zu lösen.
[It is not easy this problem to solve]
This problem is not easily solved.
a) Es ist ja nicht einfach, dieses Problem zu lösen.
[It is (ja) not easy this problem to solve]
This problem is not easily solved (as you know).
a) Es ist doch nicht einfach, dieses Problem zu lösen.
[It is (doch) not easy this problem to solve]
(But you will agree that) this problem is not easily solved.
Whereas native speakers might perceive (a) as a turn in a discussion to be quite abrupt, (b) and (c) involve
the hearer's anticipated point of view. In (b), a shared opinion is assumed, while (c) expresses the wish to
overcome a perceived difference of opinion. (Weydt 1983). Modal particles create "conversational
cohesion" (Schiffrin, 1987), in the case of doch and ja by reference to shared knowledge.
Language Learning & Technology
132
Martina Mollering
Teaching German Modal Particles...
One reason why the comprehension of modal particles is difficult for non-native speakers is the fact that
all modal particles have at least one homonym. As many particles occur in a variety of functions, criteria
such as position within the sentence play a role in determining whether a particle occurs as modal particle,
as connective, adverb of time, and so forth. The following sample of natural language data, which is an
excerpt from a discussion between secondary students and a well known German author, illustrates the
point. It provides an example of particles in use in authentic spoken German.1 Amongst others, the
particle aber occurs frequently:
A:
Ich nehm' Ihnen das ehrlich gesagt gar nich' ab. Ich hab' den Verdacht, ich meine, natürlich werd'
ich mich wahrscheinlich sogar irren, ABER (1) daß Sie die Sache so geschrieben haben, daß Sie
eben sagen "na schön," dann haben Sie sich das überlegt, und dann haben Sie die Stelle gelesen und
haben sich gesagt "na Donnerwetter, das wird ABER (2) ziehen, die werden ABER (3) staunen,
was ich mich so, was ich mir so alles traue..."
B:
ja ja, . . .
(students laughing)
wenn für mich als Autor der Begriff 'lieber Gott' etwas genau so Banales und Liebenswertes und
Unbestimmtes ist wie der Begriff 'Mädchen' (...) dann kann ich das ohne weiteres in einer Reihe
nennen. ABER (4) daß sie den lieben Gott für so leicht zu beleidigen halten, also das wundert
mich.
In (1) and (4) aber is used as a connective. It connects the clause it appears in to the preceding one and
thus creates cohesion (Halliday & Hasan, 1976) on the textual level of the text. This function can be
realised in English by using the conjunction "but".
ABER (1)
ABER daß Sie die Sache so geschrieben haben ....
BUT [the fact] that you've written it in that particular way...
ABER (4)
ABER daß Sie den lieben Gott für so leicht zu beleidigen halten...
BUT [the fact] that you think our Lord could be insulted as easily as that...
As a connective, aber occurs mainly at the beginning of a clause. Its reference is anaphoric; it expresses
contrast in its immediate context, that is, to the preceding proposition or propositions.
In (2) and (3) aber appears as a modal particle. Here, it is not as easily translated into English.
ABER (2)
... das wird ABER ziehen...
that will [ABER] be a success
ABER (3)
...die werden ABER staunen...
they will [ABER] be surprised
In these instances, aber expresses surprise and an approximation would be the following translations:
ABER (2)
ABER (3)
... das wird ABER ziehen...
boy, what a success that is going to be
boy, that'll / will that ever go down well
...die werden ABER staunen...
they're going to be surprised, I can tell you
baffled/astonished
they're gonna be absolutely
Language learners are regularly faced with the task of distinguishing between the different meanings of a
particle like aber. It is the contention of this paper that they may be aided in this by an analysis of reallanguage data which unveils structures, patterns, and predictable features regarding a particle's different
usages. The exploitation of language corpora is proposed here in order to arrive at authentic teaching
Language Learning & Technology
133
Martina Mollering
Teaching German Modal Particles...
materials which facilitate the comprehension of German modal particles. The association patterns which
were of particular interest in this investigation are linguistic features in terms of lexical and grammatical
associations (Biber, Conrad, & Reppen, 1998, p. 6). Non-linguistic associations like the distribution of
modal particles across registers have been dealt with to some degree through the selection of corpora for
the analysis, while distribution across dialects or across time periods was not examined.
Occurrences of Modal Particles in Different Text Types
Following the definition that a text is "either spoken or written discourse, so that for example the words
used in a conversation (or their written transcription) constitute a text" (Fairclough, 1995, p. 4), modal
particles occur more frequently in spoken than in written texts. Rudolph (1991) found that in
conversation, particles and conjunctions are used almost three times as frequently as in journalistic and
literary texts, but she does not provide a specific analysis of words in modal particle function, as her
definition of particles is a very wide one. She classified text types according to the supposed dichotomies
of oral/written and fictional/non-fictional and investigated the text types everyday conversations
(oral/non-fictional), newspaper articles (written/non-fictional), and (sections from) narrative texts
(written/fictional) for the occurrence of particles.
The assumption of a distinction between spoken and written texts as a dichotomy has been challenged.
Biber (1988), for instance, proposes no such dichotomy of dimensions across texts, no clear cut
distinction between spoken and written texts, but multidimensional distinctions. McCarthy (1993) uses
the terminology "spoken and written medium" but also describes complexities and mixing. He proposes
as a useful distinction the terminology of medium which "is concerned with how the message is
transmitted to its receivers" and mode which "is concerned with how it is composed stylistically, that is,
with reference to sociolinguistically grounded norms of archetypical speech and archetypical writing.
These norms are norms of appropriacy, culturally conditioned on a cline of 'writtenness' and
'spokenness'." (McCarthy, 1993, p. 171)
Following this distinction, the database chosen for this study consists of four corpora of spoken German
in the sense of "medium: spoken." Three of the corpora are held at the German Language Institute
(Institut für Deutsche Sprache, IDS), namely the "Freiburger Korpus (FKO)," "Dialogstrukturenkorpus
(DSK)," and "PFEFFER-Korpus (PFE)." The fourth corpus consists of a collection of telephone
conversations published by Brons-Albert (1984).
Freiburger Korpus (FKO). The corpus consists of 224 texts with a total of 700,000 words. It was
compiled mainly between 1966 and 1972 as part of a project at the IDS that aimed at describing
"grammatical and stylistic" features of spoken German. Audiorecordings from radio and television
broadcasts as well as other recordings of private and public speech events were collected. Speakers were
either not aware of being recorded or recording was a natural part of the speech event (as in the radio and
television broadcasts), and they did not know that their productions were to be linguistically analysed.
The recordings have been transcribed and categorised into discussions, interviews, talks, reports, and
narrations.
Dialogstrukturenkorpus (DSK). This corpus contains 72 texts with about 200,000 words. It was
compiled by a group of researchers of the German department at Freiburg University in conjunction with
the IDS in the periods 1968 - 1972 and 1974 - 1977 in order to further analyse the organisation of natural
conversation (see FKO). It consists mainly of interviews (radio and television broadcasts) and
discussions.
Pfeffer-Korpus: (PFE). Compiled by A. Pfeffer and W. Lohnes at Stanford University, California, in the
early 1960s, the corpus comprises 398 texts with a total of 650,000 words. Recordings were made in 56
different areas of Germany, Austria, and Switzerland with a total of 400 different speakers. Each
recording is about 12 minutes in length ( about 1500 words) on 1 of 25 topics. The subjects (with a spread
Language Learning & Technology
134
Martina Mollering
Teaching German Modal Particles...
of age, sex, education, and profession following a statistical analysis) were interviewed on those topics in
397 of the texts; text 398 is a group discussion between four speakers.
All three corpora can be accessed via a data retrieval system, COSMAS (Institut für Deutsche Sprache,
1999), developed at the IDS. It allows an analysis of the data through frequency counts and
concordancing procedures which makes it possible to search all three corpora of transcribed spoken
German -- with a total of about 1.5 million words -- for occurrences of particles in context. An update of
the PFEFFER-Korpus (Jones, 1997) was not yet accessible (personal communication with Jones) at the
time of data analysis.
Telephone conversations (BRO; Brons-Albert, 1984). This collection is made up of 35 texts and includes
a total of about 44,000 words. The data were arrived at by recording telephone conversations which the
researcher, Brons-Albert, had on her private phone over a period of 10 months. Callers were unaware of
being recorded. With permission of the individual speakers, a selection of conversations were transcribed
and published. For each dialogue, information on the speakers' age, profession and/or education, dialect,
and the relationship between the speakers is provided. For the purpose of the present study, the printed
texts were scanned into computer-readable files to make them accessible for concordancing.
QUANTITATIVE ANALYSIS
The first step in the process of data analysis was to establish the frequency of particles which could
potentially function as modal particles2 in the four corpora. Frequency of occurrence has been advanced
as one grading criterion (Busse, 1992; Vorderwülbecke, 1981) for the teaching of modal particles. Taking
into account the multifunctionality of particles and learners' difficulties with distinguishing different
particle functions, the term particle frequency can be seen as ambivalent. The term frequency might, on
the one hand, refer to the occurrence of a word in modal particle function, or it might refer to all
occurrences of a word, of which only some might be occurrences in modal particle function. In the
present study, particle frequency is addressed in two steps: first, the overall frequency of particles in the
corpora of spoken German is established in order to determine how salient each particle would be for a
learner of German. A subset of the occurrences of the particle eben is then analysed qualitatively. The
qualitative analysis provides a distinction between frequency of occurrences in modal particle function
and other functions.
The three corpora held at the IDS (DSK, FKO, PFE) were searched with the help of COSMAS (Institut
für deutsche Sprache, 1999); the fourth corpus (BRO) was searched using Microconcord (Scott & Johns,
1993). The total number of occurrences of each word in each of the corpora was established. As the
different corpora vary considerably in size, raw counts of frequency were normalised to make counts
comparable. Frequency per 1,000 words of text was chosen as a basis of comparison.3 The following table
provides an overview of particle frequence in all four corpora, that is, over a total of nearly 1,600,000
words:
Language Learning & Technology
135
Martina Mollering
Teaching German Modal Particles...
Table 1. Frequency of Word Occurrence per 1,000 Words in the Four Corpora
Most striking is the frequency of ja with 19.5 occurrences per 1,000 words overall, which is more than
double the frequency of the next word in line auch with 8.9 occurrences, followed by aber with 5.9
occurrences per 1,000 words. Then follow mal with 4.4 occurrences down to eben with 2.1. More than
half the particles analysed occur with an average frequency of less than 2 (vielleicht 1.5, down to eh and
ruhig with 0.1).
The following table presents the frequency of occurrence per 1,000 words in the four different corpora:
Table 2. Frequency per 1,000 Words, All
Language Learning & Technology
136
Martina Mollering
Teaching German Modal Particles...
Both tables show clearly that ja, auch, and aber are the most salient, followed by a second group made up
of mal, doch, schon, denn, nur, and eben. Again ja provides the most striking pattern with an enormous
variation of frequency between the four corpora. It is most frequent in the BRO corpus with 33.7
occurrences per 1,000 words, followed by 19.8 in DSK, 13.4 in FKO, and 10.9 in PFE. The most
frequently occurring words with a potential for modal particle function (ja down to denn) occur with a
particularly high frequency in BRO.
The existing corpora of spoken German are relatively small in comparison to the corpora available for
spoken English, for example, the British National Corpus with a spoken component of about 10 million
words (see, Berglund, 1999). The composition of the different corpora indicates that although they can be
broadly classified as "spoken German," there are significant differences with regard to "mode"
(McCarthy, 1993). German modal particles have been found to occur most frequently in texts which are
informal, personal, associative, and with a high level of familiarity (Hentschel, 1986). In particular, the
level of informality and familiarity of speakers with one another varies considerably between the four
corpora. The predefined corpus text categories provided in the description of the corpora held at the IDS
are rather broad. Although all the corpora comprise dialogues, the nature of these dialogues in FKO and
DSK is rarely personal. The dialogues in PFE are determined by the method of data collection: an
interviewer talking to a person s/he is not familiar with. The ensuing dialogues are in fact largely
monologic as the interviewer's brief questions prompt long stretches of narrative on the part of the person
interviewed. The highest level of informality and familiarity between speakers can be found in the
compilation of texts by Brons-Albert (1984) which, for this study, led to the decision to concentrate on
those texts in the qualitative analysis of the data (for a discussion of text categories with regard to
formality, see Sigley, 1997).
Qualitative Data Analysis: EBEN
The second stage of the analysis investigated modal particles in context in order to establish patterns of
collocation in terms of lexical co-occurrence as well as co-occurrence with certain grammatical choices
(Sinclair, 1991). To this end KWIC (Key Word In Context) concordances were compiled of the BRO data
using the concordancing software package MicroConcord (Scott & Johns, 1993).
The concordancing software used in analysing the corpus data lists the occurrences of the word under
investigation in context, but is not able to distinguish between different functions of the word in question.
"Tagging," where researchers have marked words in a corpus as belonging to categories like verb, noun,
subjunctor (for a more detailed discussion of tagging see Biber, Conrad, & Reppen, 1998, p. 261f) is not
available for particle functions (Jones in Wichmann, Fligelstone, McEnery, & Knowles, 1997, p. 152) and
a qualitative analysis was necessary to distinguish between modal particle function and others.
The BRO corpus was searched for the word in question and the ensuing concordances were categorised
by making use of the program's classification feature. Moving the cursor to the concordance line to be
categorised and entering a number allows subsequent sorting of lines according to categories (Witton,
1994). The categorization of occurrences was in the first instance based on native speaker intuition. It had
to be carried out in many instances by looking at larger stretches of the text, as the information provided
in the KWIC concordance was often not sufficient to distinguish between different usages of the word
under investigation. In order to distinguish use in modal particle function from other possible functions of
the words in question, it was necessary to manually disambiguate each occurrence of the word to establish
patterns which language learners could be made aware of to help them distinguish modal particle
functions from others.
As one example of the qualitative analysis, an investigation of the concordance data on eben is detailed
below. Eben was chosen as it belongs to the group of more frequently occurring particles (see
Quantitative Analysis) without yielding too many occurrences for the scope of this article.
Language Learning & Technology
137
Martina Mollering
Teaching German Modal Particles...
The particle eben occurs in 20 different texts in BRO, in the functions of modal particle, answering
particle and adverb of time.4
Answering Particle
As an answering particle (27 occurrences), eben is easily recognised within the concordance data, as it
can be found in the initial position of an utterance. In some instances, it appears as a complete utterance
and its representation is capitalised. This is, of course, a channel-specific measure by which the
orthographic realisation in the transcription tries to interpret the intonation patterns of the original spoken
text:5
1
2
3
höne Bücher, die man lesen kann. A:
kann man so nie wissen. B: (lacht)
sen müssen, was wer noch kaufen. A:
Eben!
Eben!
Eben.
B: und so viele schöne Sachen, di
C: Ja, aber hättsde das direkt ge
B: Würd ich sagen, dann geh ich d
It can function as the opening of an utterance, but separate from the following proposition:
4
5
6
7
8
9
Doppelte als die ganze Zeit, ne. B:
ja nichts Schlimmes! A: Ja, ne? B:
alles, wozu man jetz nich kommt! B:
n die elf Kilo abgenommen hätte! D:
er Frau auch Frau Doktor Sounso. B:
ort "werden" nich, anscheinend . B:
Eben.
Eben.
Eben.
Eben,
Eben,
Eben,
Is auch schön. A: Undie Arbeit ma
Solang se sich dabei wohlfühlt, si
Du, meine Mutter, die hatte ne gan
ich denk, die is doch gar ni mehr
dann bisde ooch Herr Dokta! A: Ri
siehsde, un, stimmt auch wirklich,
It also occurs in combination with a second answering particle "ja" (yes), "nein" (here pronounced and
transcribed as "nee"; no) or "hm":
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
, ich helf ihr, soweit's geht B: ja,
uns ja nun wieder auch nich B: Nee,
m Telefon merkt es ja keiner. B: Ja,
.. Hauptsache, es klappt! B: Och ja,
ann alles, wenn ich will, ne. D: Ja,
h kann Sie nur beglückwünschen. Nee,
tung, das Rauchen einstellen! D: Ja,
der Effekt ja oh ni mehr da! A: Ja,
eder mal die Eßbremse ziehen. D: Ja,
alle 8 Tage da . losjehn, ne. A: Ja,
einer allein schuld is, ne. B: Ja,
irekt am 31. feiern, abends. B: Ja,
les, wenn ich will, ne. D: Ja, eben,
jetz niet, wann se kommt. A: Jaja,
esigen Vertrag beim Notar ab. B: Hm,
, ne? A: Jo, is ja ejal, ne. B: Ja,
uern zu sparen, zu heiraten. B: Jo,
wann un wie oft er Lust hat, ne? A:
eben
eben
eben!
eben,
eben,
eben,
eben,
eben,
eben,
eben.
eben.
eben.
eben.
eben.
eben.
eben.
eben.
Eben,
A: und son bißchen Telefongespräch
( ) Hast du mal deinen Pullover aus
A: Und dann kriegt der hinterher
wenn et so janz jut weiterläuft, s
eben. Versteh ich. B: Naja! Zwei
das war, wie C, ich war der Meinun
nee / B: Das versucht manch einer
genau! Vor allen Dingen, es geht j
kann ich verstehn, dat kann ich ve
B: Das geht nich. Könn Se noch ma
A: Irgendwie en ganz kleinen Grun
dass ja wirklich Klasse! A: Hm. J
Versteh ich. B: Naja! Zwei Kilo n
So lange dauert die Fahrt ja nich.
Nee, ganz davon abgesehen, nem. I
(lacht) A: (lacht) B: Bis ja noc
klar, und außerdem ist das total
ja. B: Paar Würste dazu oder irge
In all these occurrences it serves to confirm the previous speaker's contribution.
Adverb of Time
In its occurrences as an adverb of time (13), eben is a short form of soeben (just, a moment ago). In this
particular use it is harder to distinguish from modal particle function as its position within the clause is
similar to that of modal particles.
Language Learning & Technology
138
Martina Mollering
1
2
3
4
5
6
7
8
9
10
11
12
13
Teaching German Modal Particles...
ja B: Telefon so anders. Em, ich hab
C: C A: Guten Tag, Herr C, ich hab
Ja, hörma, wat sachsde dazu, was ich
n dann? . Meine Mutter hat dich zwar
u, ich wollt dir nur sagen, der Z war
a, ich hab der A eben gesacht, daß se
. den Abend ruhig gestalten, die war
onsequent wär, ich war zwar sachtich
hen das über'n DeAEs, und der meinte
ppelt belegt hat, ich seh nämlich da
erzählt hab? B: Nee, was hasde denn
a, der X hat sich gewundert, weil ihr
acht? Mit der A? C: Ja, ich hab der A
eben
eben
eben.
eben
eben
eben
eben
eben
eben,
eben,
eben,
eben,
eben
mit der Frau X von der Verwaltung
mit einem Kollegen von Ihnen gespr
der A erzählt hab? B: Nee, was
schon den ganzen Quark gefragt, abe
hier, die Schreibmaschine is also
vorbeigebracht wurde B: Ach so! C:
einkaufen und mußte sich danach hi
noch zu C, ich bin jetz noch stolz,
ja, ich solle auf jeden Fall nen
Fenelon, Lettres a l'Academie hab
ich . hab das nich / C: Ja, von w
als ihr ihn aus demAuto ließt, g
gesacht, daß se eben vorbeigebracht
A contextual clue, however, is its collocation with one of the German tenses expressing reference to the
past: Simple Past, Present Perfect, Past Perfect. Investigating a larger stretch of the dialogue reveals that
this is the case in nearly all occurrences:
line 1:
line 2:
line 3:
line 4:
line 5:
line 6:
line 7:
line 8:
line 9:
line 11:
line 12:
line 13:
hab...gesprochen
hab...gesprochen
hab...erzählt
hat...gefragt
war
wurde...vorbeigebracht
war (einkaufen)
sachtich
meinte
hasde...
habt gesagt
hab...gesacht
(Perfekt "hab" = habe)
(Perfekt)
(Perfekt)
(Perfekt)
(Imperfekt)
(Imperfekt, passive voice)
(Imperfekt)
(sagte ich, Imperfekt)
(Imperfekt)
(Perfekt: hast du ...ellipsis of past participle)
(Perfekt)
(habe gesagt; Perfekt)
What can be established from the evidence is a strong correlation between eben in its function as soeben
(just, a moment ago) and verb forms expressing the past. For a native speaker familiar with all the
functions of eben this is quite obvious but for a learner of German recognizing this collocational pattern is
helpful in distinguishing the different meanings of the word.
A particular meaning of eben in its temporal function comes about when it collocates with ma(l) (12
occurrences):
1
2
3
4
5
6
7
8
9
10
11
12
eben / B: Ja, Augenblick, ich hör
onntag oder bis Montag, Momentchen
rade, das könnt nich sein, Moment
r Messe! A: Ah! 69 B: Da müßtich
ame) C: (Straßenname)? Da muß ich
her ein Bier getrunken/ B: Moment
Ich mein, wenn der schon
kommen? B: Ja, kommen Se morgen
llt mir grade ein, kannst du mir
orz. B: Warte mal, kann ich noch
Sie vielleicht freundlicherweise
ng an / einschalten, daß er dann
Language Learning & Technology
ma eben,
ma eben,
ma eben!
ma eben
ma eben
ma eben!
mal eben
mal eben,
mal eben
mal eben
mal eben
mal eben
Frau A: Hm. B: Ja? ((Stimme im Hin
ja? ((20s)) Ne, das is bis zum 9. A
(lacht) Ich gebn dir ma. D: Ja, Mom
nachgucken, das is entweder nur bis
nachguggen, nech. A: ja. ((59s)) C:
(zu ihrer Mutter) Ja, ich komm gleic
. dieses Knöllchen da ausgestellt ha
ja? A: Is gut. Hm, danke. B: Ne? B
mit kurzen Worten sagen, wie man ein
sehen? Das is Porz, ja achthundertzw
so durchrufen, wann der Herr U da Fr
so tickt, das hat ja nicht zu sagen,
139
Martina Mollering
Teaching German Modal Particles...
In these instances, eben does not refer to the past, but together with ma(l) functions to point to the short
duration of an event. This is particularly apparent in lines 2, 3, and 6:
2
3
6
onntag oder bis Montag,
erade, das könnt nich sein,
her ein Bier getrunken/ B:
Momentchen ma eben,
Moment ma eben!
Moment ma eben!
ja? ((20s)) Ne, das is bis zum 9. A
(lacht) Ich gebn dir ma. D: Ja, Mom
(zu ihrer Mutter) Ja, ich komm gleic
The collocation with "Moment" and especially with its diminutive form "Momentchen" (just a
moment/wait a minute) stresses the temporal aspect as well as the short duration of the wait.
In a number of instances there is a further aspect to the combination of mal and eben:
1
4
5
8
9
10
11
eben / B: Ja, Augenblick, ich hör
r Messe! A: Ah! 69 B: Da müßtich
ame) C: (Straßenname)? Da muß ich
kommen? B: Ja, kommen Se morgen
llt mir grade ein, kannst du mir
orz. B: Warte mal, kann ich noch
Sie vielleicht freundlicherweise
ma eben,
ma eben
ma eben
mal eben,
mal eben
mal eben
mal eben
Frau A: Hm. B: Ja? ((Stimme im Hin
nachgucken, das is entweder nur bis
nachguggen, nech. A: ja. ((59s)) C:
ja? A: Is gut. Hm, danke. B: Ne? B
mit kurzen Worten sagen, wie man ein
sehen? Das is Porz, ja achthundertzw
so durchrufen, wann der Herr U da Fr
Here, the temporal aspect "it doesn't take long" also has a pragmatic function: If something does not take
long to do, then it is not much of an imposition to ask for it to be done. In lines 1, 4, and 5 the speaker
wants to assure his/her interlocutor that what is being done for him/her is not too much of an
inconvenience:
1
4
5
ich hör ma eben, Frau
[I'll quickly find out]
Da müßt ich ma eben nachgucken
[I would have to have a quick look]
Da muß ich ma eben nachguggen
[I'll have to have a quick look]
In 8, 9, 10, and 11 the interlocutor is being assured that the imposition posed on him/her is minor:
8
9
10
11
kommen Se morgen mal eben, ja
[why don't you quickly come by tomorrow -> why don't you drop round tomorrow]
kannst du mir mal eben mit kurzen Worten sagen
[could you quickly tell me in a few words]
kann ich noch mal eben sehen?
[could I have another quick look]
könnten}Sie vielleicht freundlicherweise mal eben so durchrufen
[would you be so kind to give us a quick call]
Language Learning & Technology
140
Martina Mollering
Teaching German Modal Particles...
Modal Particle
In the following 32 occurrences eben functions as a modal particle:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
agst, wie das gewesen wär, du hättest
ich. Und . da muß ich jetz am Montag
ne, bloß / bloß B: ah! A: muß sie
und so weiter un alles dafür. Und da
zwanzig, jo. A: Aber montags geht
ch schicken will, dann schicke ik se
dagegen Widerspruch eingelegt und muß
A: Ich mein, sie hat
möglichst, öh, viel von, ne, damit se
jetz von der Schule bringen, weil er
Verrücktheit und so. Aber as Eine is
albe Stelle abtreten und inoffiziell
r du tust Milch und Zucker rein, mußt
ht moglich sein, ja, weil 8.000 Mark
n auch zu uns kommen, A. Ihr könnt .
st nie damit gerechnet, daß die Bank
machen. A: Jaja, klar. B: Das is ja
en, mit/mit Ananas drin, so un/ mußt
Wanzen kommen? A: Ja, das weiß ich
es aus B: mal alles aus, weil wer ja
ißig Parteien A: hmhm B: un wenn da
besichtigt, und so, ne. Bloß, es is
? A: An un für sich, ja, bloß, es is
B: un Schulen, und, em, na, Bücherei
viduelle . Vergütung und ä, das wird
ätten wer hinterher aufgegeben, weil
Ah so! B: da is eine Kiesgrube und
Na, ich ja im Grunde auch. Da ham wer
n, ich hatte aber . grad zu der Zeit
Vielleicht wollt ich mir auch selbst
das nur sagen. B: Hm C: Die wurde mir
, verzichten müssen. Nein, ich hatte
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben
eben,
eben,
eben
eben,
eben,
eben/
eben
eben
eben
eben
damals die Bankgeschichte nich wei
. meinen Widerspruch begründen. A:
für die Doktorarbeit muß sie das Ga
möglichst, öh, viel von, ne, damit
/ B: Ja, Augenblick, ich hör ma ebe
nich. A: Doch, schick sie ruhig! I
jetz vor's Amtsgericht. A: Ja, un
bloß bessere Changsen, da weiterzu
auch sagen kann, das sind nich blo
zu autoritär is, ne, auf der Schule
doch ne Geschichte, die e
nur ne Viertel. B: nur ne Viertel!
auch wieder Süßstoff nehmen und au
viel Geld sind, dann, em, müBten w
. einfach ma nur so vorbei / brauch
so'n Mist macht oder die . entsprec
dat Doofe. ((Räuspern)) A: Sicher.
Süßstoff nehmen, darfs kein/ keine
auch nich! B: Meistens sitzen die
wissen müssen, was wer noch kaufen
größere, öh, Reparaturen notwendig
/ du kanns schlecht en Fenster aufm
. daß ich doch son bißchen / also
die Ärzte befinden sich alle in d
dadurch erleichtert, daß es nur hal
öh, sie mit ihrem Bekannten dann.
öh, A: ja B: sons nix, ne.. Un b
C: Ja, ich hab da bestimmt en unh
keine Zeit, öh, "lernschwache Milli
bloß beweisen, ich kann alles, wen
als etwas . unterkühlt vorbeigebrac
gedacht, ö, daß . er . offiziell .
In the vast majority of occurrences in modal particle function (26 of 32), eben collocates with a verb in
the present tense, as can be seen from the concordance data provided here. The verb forms which are not
included in the concordance lines shown here have been established by investigating larger stretches of
the respective dialogue.
In two instances, line 1 and line 5, there is a subjunctive form and only in four instances (lines 29 to 32)
eben in its modal particle function collocates with a verb form indicating past. The concordance data
show that eben as a modal particle occurs only in statements, there are no occurrences in interrogatives or
imperatives. This is in line with its meaning.
As a modal particle eben expresses "unchangeability," "unavoidability," or "irrevocable fact" as a detailed
analysis of the following instances will show. The most obvious examples of stating a "given fact" are
those where eben appears as part of an existential clause (e.g., Halliday, 1994), that is, where the main
verb is "sein" (to be):
Language Learning & Technology
141
Martina Mollering
22
23
Teaching German Modal Particles...
besichtigt, und so, ne. Bloß, es is
? A: An un für sich, ja, bloß, es is
eben
eben,
/ du kanns schlecht en Fenster aufm
. daß ich doch son bißchen / also
Eben functions interpersonally, expressing that a fact is evident and undeniable.
There are two instances of relational clauses (Halliday, 1994) with sein occurring as dependent clauses
introduced by weil:
10
14
jetz von der Schule bringen,
ht möglich sein, ja,
weil
weil
er
8.000 Mark
eben
eben
zu autoritär is, ne, auf der Schule
viel Geld sind, dann, em, müBten w
Here, eben works in conjunction with weil to create the impression of uttering an irrevocable fact: The
relational clause is posited as a valid argument introduced by weil.
The following two excerpts exemplify this in the context of larger stretches of text:
Context: line 10
B:
A:
B:
A:
B:
..., un wenn se frech waren, oder irgendwie was nich
richtig gemacht ham, mußten die vor die Klasse, oder aus
der Klasse un in der Ecke stehn, un so, under muß so
ungefähr ., öh, der hatte also sein erstes Referendarjahr ,
alson ganz junger noch, ne.
[...and when they were cheeky or somehow did something wrong, they had to stand in front of the
class, or leave the classroom and stand in a corner, and the like, and he must roughly. , er, he was
doing his first year of teaching, so one of the really young ones still, you know.]
Das gibt's gar nich!
[You don't say!]
Hm, und . da . ham sich aber die ganzen, öh, öh, Eltern
wahnsinnig beschwert, un wollen den jetz von der Schule
bringen, weil er eben zu autoritär is, ne, auf der Schule
jedenfalls.
[Hm, and . then . all the, er, er, parents complained like mad,
and now they want to get him out of the school, because he is
simply too authoritarian, you know, at school at least]
Ah so!
[I see!]
Das is also völlig unnormal, daß sich da einer so benehmen
würde, erzählte die Y mir/
[It's really not normal for somebody to behave like that, Y
told me]
By using eben, speaker B stresses the unavoidability of the parents' actions: they had to act like they did,
because the teacher's behaviour lay outside of what is considered normal behaviour, an argument which is
expressed again explicitly in B's next turn.
Context: line 14
B:
A:
Ja, die Garage hat uns damals 8.000 Mark extra gekostet und
sehr viel drunter wollten wer se auch nich verkaufen, ne.
[Yes, the garage cost us an extra 8000 Mark then and we didn't want to sell it for much less, you
know]
Ja. Is ja unverschämt, was die für Einstellplätze nehmen!
[Yes. It's outrageous how much they charge for car spaces]
Language Learning & Technology
142
Martina Mollering
B:
Teaching German Modal Particles...
Ja, leider. Aber wir warn damals laut Vertrag an den . Kauf
der Garage gebunden und müssen auch laut Vertrag die
Garage auch mit verkaufen, wenn wer die Wohnung
verkaufen. Ich mein, sollte das um alles in der Welt nicht
möglich sein, ja, weil 8.000 Mark eben viel Geld sind, dann,
em, müBten wer uns versuchen, da ne andere Lösung
einfallen zu lassen
[Yes, unfortunately. But at the time we were bound by the contract to buy the garage and according
to the contract we also have to sell it when we sell the apartment. I mean, if that's not at all possible,
yes, because 8000 Mark simply IS a lot of money, then, er, we would have to try to find some other
solution]
Using eben, speaker B presents the proposition "8000 Mark simply is a lot of money" as an irrevocable
fact, common knowledge that is generally agreed upon. Within the larger argument "the garage may be
difficult to sell" the phrase containing eben is a supportive move, eben providing the necessary emphasis.
In a fairly large proportion of occurrences, eben as a modal particle collocates with modal verbs, namely
müssen (have to, lines 2, 3, 7, 13, 18); können (be able to; lines 9, 155), and wollen (want to; lines 26, 30):
2
3
7
13
18
9
15
26
30
ich. Und . da muß ich jetz am Montag
ne, bloß / bloß B: ah! A: muß sie
dagegen Widerspruch eingelegt und muß
du tust Milch und Zucker rein, mußt
en, mit/mit Ananas drin, so un/ mußt
möglichst, öh, viel von, ne, damit se
n auch zu uns kommen, A. Ihr könnt .
ätten wer hinterher aufgegeben, weil
Vielleicht wollt ich mir auch selbst
eben
eben
eben
eben
eben
eben
eben
eben,
eben
. meinen Widerspruch begründen. A:
für die Doktorarbeit muß sie das Ga
jetz vor's Amtsgericht. A: Ja, un
auch wieder Süßstoff nehmen und au
Süßstoff nehmen, darfs kein/ keine
auch sagen kann, das sind nich blo
. einfach ma nur so vorbei / brauch
öh, sie mit ihrem Bekannten dann.6
bloß beweisen, ich kann alles, wen
In collocation with a form of müssen, eben lends emphasis to the obligation of carrying out a particular
act. In these clauses, eben serves to express the unavoidability of the obligation as the following example
shows in more detail.
Context: line 3
A:
B:
A:
B:
A:
B:
A:
Gut, wenn das dann alles ma fertig ist, les ich's dir ma vor!
Wie sich das anhört. Die schreibt nämlich auch Dialekt un
sowas genau wortwörtlich ab, da.
[Right, when it's all ready at some stage, I'll read it to you. The way it sounds. You see, she also
copies out the dialect and things like that word for word.]
Ja?
[Does she?]
Ja, in ihrer Examensarbeit hatte se sowas ähnliches gemacht,
ne, bloß / bloß
[Yes, in her dissertation she did something similar, you know, but]
ah!
muß sie eben für die Doktorarbeit muß sie das Ganze en
bißchen ausweiten, noch
[for her doctoral thesis she'll simply have to expand the whole thing a bit, still]
Hmhm
un noch / noch mehr bringen, ne.
[and produce some more, you know. ....]
Language Learning & Technology
143
Martina Mollering
Teaching German Modal Particles...
The English "simply" could here be expressed as "it's as simple as that," that is, no discussion about it is
necessary.
The results of the qualitative analysis carried out on eben can be summarised as follows:
Table 3. Summary of Results: EBEN
position in
clause
initial
grammatical
co-occurrence
lexical
collocation
category
meaning
mal
answering particle
adverb of time
adverb of time
form of sein
modal particle
form of müssen
modal particle
"exactly"
"just" / "a moment ago"
"quickly"
"simply"
(irrevocable fact)
"simply"
(unavoidability of
action)
indication of past
central/final
central/final
tendency for
present tense
central/final
Application of the Corpus-Based Analysis to Language Teaching
Over the past decade, corpus-based research has had an increasing influence on language teaching
pedagogy, with regard to linguistic content as well as to teaching methodology (Kennedy, 1998). While
the majority of studies reporting on corpus-based teaching approaches refer to English (e.g., Biber,
Conrad, & Reppen, 1994; Conrad, 2000; Fligelstone, 1993; Wichmann et al, 1997) a number of studies
have discussed German (Dodd, 1997, 2000; Jones, 1997). In general terms, Leech (1997) distinguishes
between the direct use of corpora in teaching and the use of corpora indirectly applied to teaching.
Teaching about corpora, teaching the exploitation of corpora and exploiting corpora to teach are said to
represent a direct use of corpora, whereas reference publishing, materials development and language
testing are indirect applications (Leech, 1997, p 6-7). Thus, the approach proposed here is direct in that it
exploits the corpora of spoken German described above to arrive at relevant data. The approach is
indirect, though, in the sense that the concordance data are not compiled by the language learners
themselves but developed into work sheets that confront the learner with the task of distinguishing
particle meanings in context.
The adaptation of concordances for language teaching is described informatively and clearly by Tribble
and Jones (1990) for English in general and by Thurstun and Candlin (1997) for academic English. The
concordance-based creation of teaching materials presented here follows approaches outlined in those
publications. Concordance data are used to assist learners deduce the meaning of words in context
(Tribble & Jones , 1990, p. 35ff). How those teaching materials will be structured and what type of
activities they will encourage will obviously depend on the learners' proficiency, learning styles, and so
forth, but the sample worksheet contained in the Appendix illustrates how the topic investigated here
could be approached. For less advanced learners samples of larger stretches of dialogue could be provided
to aid understanding.
CONCLUSION
The limited ranges of speech events which learners are exposed to in classroom discourse do not provide
enough input on modal particles to lead to an understanding of their meaning. An important factor in
teaching modal particles is therefore the exposure of learners to particles in various contexts and the
focussig of learners' attention on their meaning in those contexts. Corpus examples are extremely
effective as they expose learners to the type of language they will encounter in real communicative
situations (McEnery & Wilson, 1996, p. 120). Collocations, involving both grammar and lexis, have an
Language Learning & Technology
144
Martina Mollering
Teaching German Modal Particles...
important place in language pedagogy as they can be identified empirically by the methodologies
developed in corpus analysis (Kennedy, 1998, p. 289). The quantitative analysis of the German corpora
described above has shown which particles occur most frequently in spoken German and are therefore
most salient for a learner of German. A manual disambiguation of particle meaning was carried out on
concordance data for the particle eben. Its meaning in modal particle function was differentiated from its
meanings in other functions, namely as answering particle and as adverb of time. The analysis of reallanguage data unveiled structures, patterns and predictable features relating to the various usages of eben
and formed the basis for a sample worksheet for learners of German. Similar worksheets aimed at
intermediate to advanced learners of German will be developed for the more frequently occurring
particles ja, auch, aber, mal, doch, schon, denn, and nur. It is hoped that they will provide a useful
extension to the existing teaching materials on modal particles.
APPENDIX
SAMPLE WORK SHEET: EBEN
1. The word EBEN has different meanings which depend on the context of use. Can you find out by
looking at the following groups of examples which of the translations given below best reflects the
meaning of EBEN in each group?
simply
a moment ago/just
exactly
quickly
group 1 _____________________
group 2 _____________________
group 3 _____________________
group 4 _____________________
GROUP 1
1 schöneBücher, die man lesen kann. A: Eben! B: und so viele schöne Sachen, di
2
doppelte als die ganze Zeit, ne. B: Eben. Is auch schön. A: Undie Arbeit ma
3 alles, wozu man jetz nich kommt! B: Eben. Du, meine Mutter, die hatte ne gan
4 n die elf Kilo abgenommen hätte! D: Eben, ich denk, die is doch gar ni mehr
5 er Frau auch Frau Doktor Sounso. B: Eben, dann bisde ooch Herr Dokta! A: Ri
6
ort "werden" nich, anscheinend . B: Eben, siehsde, un, stimmt auch wirklich,
7
der Effekt ja oh ni mehr da! A: Ja, eben, genau! Vor allen Dingen, es geht j
8
, ich helf ihr, soweit's geht B: ja, eben A: und son bißchen Telefongespräch
9
uns ja nun wieder auch nich B: Nee, eben ( ) Hast du mal deinen Pullover aus
10 h kann Sie nur beglückwünschen. Nee, eben, das war, wie C, ich war der Meinun
Language Learning & Technology
145
Martina Mollering
Teaching German Modal Particles...
GROUP 2
1 Ja, hörma, wat sachsde dazu, was ich eben . der A erzählt hab? B: Nee, was
2 n dann? . Meine Mutter hat dich zwar eben schon den ganzen Quark gefragt, abe
3
u, ich wollt dir nur sagen, der Z war eben hier, die Schreibmaschine is also
4 a, ich hab der A eben gesacht, daß se eben vorbeigebracht wurde B: Ach so! C:
5 . den Abend ruhig gestalten, die war eben einkaufen und mußte sich danach hi
6 onsequent wär, ich war zwar sachtich eben noch zu C, ich bin jetz noch stolz,
7 hen das über'n DeAEs, und der meinte eben, ja, ich solle auf jeden Fall nen
8
a, der X hat sich gewundert, weil ihr eben, als ihr ihn aus dem Auto ließt, g
9 acht? Mit der A? C: Ja, ich hab der A eben gesacht, daß se eben vorbeigebracht
GROUP 3
1
eben / B: Ja, Augenblick, ich hör ma eben, Frau A: Hm. B: Ja? ((Stimme im Hin
2 onntag oder bis Montag, Momentchen ma eben, ja? ((20s)) Ne, das is bis zum 9. A
3
rade, das könnt nich sein, Moment ma eben! (lacht) Ich gebn dir ma. D: Ja, Mom
4
r Messe! A: Ah! B: Da müßtich ma eben nachgucken, das is entweder nur bis
5 ame) C: (Straßenname)? Da muß ich ma eben nachguggen, nech. A: ja. ((59s)) C:
6 her ein Bier getrunken/ B: Moment ma eben! (zu ihrer Mutter) Ja, ich komm gleic
7
Ich mein, wenn der schon mal eben . dieses Knöllchen da ausgestellt ha
8 kommen? B: Ja, kommen Se morgen mal eben, ja? A: Is gut. Hm, danke. B: Ne? B
9
llt mir grade ein, kannst du mir mal eben mit kurzen Worten sagen, wie man ein
10 orz. B: Warte mal, kann ich noch mal eben sehen? Das is Porz, ja achthundertzw
11
Sie vielleicht freundlicherweise mal eben so durchrufen, wann der Herr U da Fr
12
ng an / einschalten, daß er dann mal eben so tickt, das hat ja nicht zu sagen,
GROUP 4
1
besichtigt, und so, ne. Bloß, es is eben / du kanns schlecht en Fenster aufm
2
A: An un für sich, ja, bloß, es is eben, . daß ich doch son bißchen / also
3
jetz von der Schule bringen, weil er eben zu autoritär is, ne, auf der Schule
4 ht möglich sein, ja, weil 8.000 Mark eben viel Geld sind, dann, em, müBten w
5 ich. Und . da muß ich jetz am Montag eben . meinen Widerspruch begründen. A:
6
ne, bloß / bloß B: ah! A: muß sie eben für die Doktorarbeit muß sie das Ga
7 dagegen Widerspruch eingelegt und muß eben jetz vor's Amtsgericht. A: Ja, un
8
du tust Milch und Zucker rein, mußt eben auch wieder Süßstoff nehmen und au
9 en, mit/mit Ananas drin, so un/ mußt eben Süßstoff nehmen, darfs kein/ keine
2. Where is the position of EBEN in the clause? Please circle the correct answer.
group 1
initial positon
middle/end position
group 2
initial positon
middle/end position
group 3
initial positon
middle/end position
group 4
initial positon
middle/end position
3. Now look at group 2 again and identify the verb forms in the clauses with EBEN.
Write down the verb forms and their tenses.
Language Learning & Technology
146
Martina Mollering
Teaching German Modal Particles...
line 1______________________
line 2______________________
line 3______________________
line 4______________________
line 5______________________
line 6______________________
line 7______________________
line 8______________________
line 9______________________
4. Which word appears in front of EBEN in group 3? ______________________
5. Examine group 4 again. Write down the verb forms.
line 1_____________________
line 2______________________
line 3_____________________
line 4______________________
line 5_____________________
line 6______________________
line 7_____________________
line 8______________________
line 9_____________________
Which two verbs do you find in these clauses?
verb 1: ___________________
verb 2: ___________________
Which tense is used in these clauses? _________________________
6. Please supply the appropriate translation for EBEN.
Position in clause
initial
Reference to time
Collocation
Past
central/final
central/final
Present
central/final
MAL
form of
"sein"
form of
"müssen"
Type of word
answering particle
adverb of time
adverb of time
modal particle
Translation
modal particle
NOTES
1. Freiburger Korpus, Schulklassengespräch mit Günter Grass (FKO/XAM.00000); transcription has
been modified to facilitate reading comprehension.
2. The list represents the core particles considered to occur in modal particle function and is based on an
evaluation of a substantial part of the literature on modal particles (Helbig, 1994; Thurmair 1989;
Weydt, 1979, 1981, 1983, 1989).
3. DSK: 200,000 words; 70 texts; average length of text, 2857 words
FKO: 700,000 words; 220 texts; average length of text, 3182 words
PFE: 650,000 words; 386 texts; average length of text, 1684 words
BRO: 44,000 words; 35 texts; average length of text, 1257words
Language Learning & Technology
147
Martina Mollering
Teaching German Modal Particles...
4. These categories are based on an evaluation of the literature on eben in different function categories
(Hartmann, 1979; Helbig, 1994; Hentschel, 1986; Lütten, 1979; Thurmair, 1989; Trömel-Plötz 1979).
5. The analysis presented here is based on transcripts of spoken language and therefore does not refer to
phonological features of the data.
6. The text continues as follows: "...schon auf die Bekanntgabe der Ergebnisse warten wollte."
ACKNOWLEDGEMENTS
I would like to thank Nic Witton and three anonymous reviewers for their helpful comments on a
previous draft of this article.
ABOUT THE AUTHOR
Martina Möllering is Head of German Studies in the Department of European Languages at Macquarie
University, Sydney, Australia. She is involved in language teaching and teacher training in German as a
Foreign Language. Her research interests include the application of computers in language teaching,
particularly the use of corpora and on-line communication facilities.
E-mail: martina.mollering@mq.edu.au
REFERENCES
Abraham, W. (1991a). Discourse particles in German: How does their illocutionary force come about? In
W. Abraham (Ed.), Discourse Particles. Amsterdam: Benjamin.
Abraham, W. (1991b). Modal particle research. The state of the art. Multilingua, 10, 1-2.
Aijmer, K. (1997). I think - an English modal particle. In T. Swan & O. Westvik (Eds.), Modality in
Germanic languages (pp. 1-47). Berlin: de Gruyter.
Berglund, Y. (1999). Exploiting a large spoken corpus: an end-user's way to the BNC. International
Journal of Corpus Linguistics, 4(1), 29-52.
Biber, D. (1988). Variation across speech and writing. Cambridge, UK: Cambridge University Press.
Biber, D., Conrad, S., & Reppen, R. (1994). Corpus-based approaches to issues in applied linguistics.
Applied Linguistics 15, 169-189
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics. Investigating language structure and use.
Cambridge, UK: Cambridge University Press
Brons-Albert, R. (1984). Gesprochenes Standarddeutsch: Telefondialoge [Spoken standard German:
Telephone conversations]. Tübingen, Germany: Günter Narr.
Bublitz, W. (1978). Ausdrucksweisen der Sprechereinstellung [Ways of expressing speaker attitude].
Tübingen, Germany: Niemeyer.
Busse, D. (1992). Partikeln im Unterricht Deutsch als Fremdsprache [Particles in teaching German as a
foreign language]. Muttersprache, 102(1), 37-59.
Cheon-Kostrzewa, B. J., & Kostrzewa, F. (1997a). Der Erwerb der deutschen Modalpartikeln. Ergebnisse
aus einer Longitudinalstudie (I) [The acquisition of German modal particles. Results from a longitudinal
study]. Deutsch als Fremdsprache, 2, 86-92.
Cheon-Kostrzewa, B. J., & Kostrzewa, F. (1997b). Der Erwerb der deutschen Modalpartikeln. Ergebnisse
aus einer Longitudinalstudie (II). Deutsch als Fremdsprache, 3, 150-155.
Language Learning & Technology
148
Martina Mollering
Teaching German Modal Particles...
Conrad, S. (2000). Will corpus linguistics revolutionize grammar teaching in the 21 century? TESOL
Quarterly, 3, 548-560.
Dalmas, M. (1990). Partikelforschung "konkret" [Research into particles: “concrete”]. Deutsch als
Fremdsprache, 27(5), 285-289.
de Vriendt, S., Vandeweghe, W., & Van de Craen, P. (1991). Combinatorial aspects of modal particles in
Dutch. Multilingua, 10(1/2), 43-59.
Dodd, B. (Ed.). (2000). Working with German corpora. Birmingham, UK: Birmingham University Press
Fairclough, N. (1995). Critical discourse analysis. New York: Longman
Fligelstone, S. (1993). Some reflections on the question of teaching, from a corpus linguistics perspective.
ICAME Journal, 17, 97-109.
Franck, D. (1979). Abtönungspartikel und Interaktionsmanagement [Toning particles and the managment
of an interaction]. In H. Weydt, (Ed.), Die Partikeln der deutschen Sprache [The particles of the German
language] (pp. 3-13). Berlin: de Gruyter.
Halliday, M. A. K. (1994). An introduction to functional grammar (2nd ed.). New York: Arnold.
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London: Longman.
Harden, T., & Rösler, D. (1981). Partikeln und Emotionen - zwei vernachlässigte Aspekte des gesteuerten
Fremdsprachenerwerbs [Particles and emotions - two areas neglected in foreign language instruction]. In
H. Weydt (Ed.), Partikeln und Deutschunterricht [The particles of the German language] (pp. 67-80).
Heidelberg, Germany: Groos.
Hartmann, D. (1979). Syntaktische Funktionen der Partikeln eben, eigentlich, einfach, nämlich, ruhig,
vielleicht und wohl [Syntactic functions of the particles “eben,” “eigentlich,” “einfach,” “nämlich,”
“ruhig,” “vielleicht” and “wohl”]. Zur Grundlegung einer diachronischen Untersuchung von Satzpartikeln
im Deutschen. In H. Weydt (Ed.), Die Partikeln der deutschen Sprache [The particles of the German
language] (pp. 121-138). Berlin: de Gruyter.
Helbig, G. (1989). Die Partikeln - keine Wortklasse, eine Wortklasse oder mehrere Wortklassen? [The
particles - no word class, one word class or several word classes]. Germanistisches Jahrbuch DDR-UVR,
8, 194-209.
Helbig, G. (1994). Lexikon deutscher Partikeln (2nd ed.). Berlin: Langenscheidt.
Helbig, G. & Helbig, A. (1995). Deutsche Partikeln - richtig gebraucht? [German particles – used
correctly?]. Berlin: Langenscheidt.
Held, G. (1983). "Kommen Sie doch" oder "Venga pure." Bemerkungen zu den pragmatischen Partikeln
im Deutschen und Italienischen am Beispiel auffordernder Sprechakte [“Kommen Sie doch” or “Venga
pure”. Remarks on pragmatic particles in requests in German and Italian.]. In M. Dardano, W. V.
Dressler, & G. Held (Eds.), Parallela (pp. 316-336). Tübingen, Germany: Narr.
Hentschel, E. (1986). Funktion und Geschichte deutscher Partikeln. Ja, doch, halt und eben [The function
and history of German particles.”Ja," “doch,” “halt” and ”eben”]. Tübingen, Germany: Niemeyer.
House, J., & Kasper, G. (1981). Politeness markers in English and German. In: F. Coulmas (Ed.),
Conversational routine: Explorations in standardized communication situations and prepatterned speech
(pp. 157-186). New York: Mouton.
Husso, A. (1981). Zum Gebrauch von Abtönungspartikeln bei Ausländern [ On the use of toning particles
by non-native speakers]. In H. Weydt (Ed.), Partikeln und Deutschunterricht [Particles and the teaching
of German] (pp. 81-99). Heidelberg, Germany: Groos.
Language Learning & Technology
149
Martina Mollering
Teaching German Modal Particles...
Institut für deutsche Sprache. (1999). COSMAS.
Jones, R. (1997). Creating and using a corpus of spoken German. In A. Wichmann, S. Fligelstone, T.
McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 146-156). New York: Longman.
Kasper, G. (2000, March). Four perspectives on L2 pragmatic development. Revised version of a plenary
given at the annual AAAL conference, Vancouver.
Kasper, G., & Rose, K. (1999). Pragmatics and SLA. Annual Review of Applied Linguistics, 19, 81-104.
Kennedy, G. (1998). An introduction to corpus linguistics. New York: Longman.
Kriwonossow, A. (1977). Die modalen Partikeln in der deutschen Gegenwartssprache [Modal particles in
contemporary German]. Göppingen, Germany: Kümmerle.
Kutsch, S. (1985). Zur Entwicklung des deutschen Partikelsystems im ungesteuerten Zweitspracherwerb
ausländischer Kinder [On the development of the German particle system in children’s uninstructed
second language acquisition]. Deutsche Sprache, 3, 230-257.
Leech, G. (1997). Teaching and language corpora - a convergence. In A. Wichmann, S. Fligelstone, T.
McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 1-23). New York: Longman.
Lütten, J. (1979). Die Rolle der Partikeln doch, eben und ja als Konsensus-Konstitutiva in gesprochener
Sprache [The role of the particles “doch,” “eben” and “ja” in creating consensus in spoken language]. In
H. Weydt (Ed.), Die Partikeln der deutschen Sprache [The particles of the German language] (pp. 3038). Berlin: de Gruyter.
McCarthy, M. (1993). Spoken discourse markers in written text. In J. Sinclair, M. Hoey, & G. Fox (Eds.),
Techniques of description. Spoken and written discourse (pp. 170-182). New York: Routledge.
McCarthy, M., & Carter, R. (1994). Language as discourse. New York: Longman.
McEnery, T. & Wilson, A. (1996). Corpus linguistics. Edinburgh: Edinburgh University Press
Möllering, M., & Nunan, D. (1995). Pragmatics in interlanguage: German modal particles. Applied
Language Learning, 6(1/2), 41-64.
Paneth, E. (1981). Partikeln im Unterricht - Erfahrungen mit englischen Studenten [Particles in teaching experiences with English students]. In H. Weydt (Ed.), Partikeln und Deutschunterricht [Particles and the
teaching of German] (pp. 101-110). Heidelberg, Germany: Groos.
Rall, M. (1981). "¿Se puede ensenar la necesidad de emplear particulas intencionales?" Ein Experiment
mit spanischen Studenten [Is it possible to teach the necessity of using intentional particles? An
experiment with Spanish students]. In H. Weydt (Ed.), Partikeln und Deutschunterricht [Particles and the
teaching of German] (pp. 123-136). Heidelberg, Germany: Groos.
Rehbein, J. (1979). Sprechhandlungsaugmente. Zur Organisation der Hörersteuerung [Speech act
enhancers. On the organisation of hearer guidance]. In H. Weydt (Ed.), Die Partikeln der deutschen
Sprache [The particles of the German language] (pp. 58-74). Berlin: de Gruyter.
Rudolph, E. (1989). Partikeln in der Textorganisation [Particles in the organisation of a text]. In H. Weydt
(Ed.), Sprechen mit Partikeln [Particles in talk ] (pp. 498-510). Berlin: de Gruyter.
Rudolph, E. (1991). Relationships between particle occurrence and text types. Multilingua, 10(1/2), 203223.
Rutherford, W. (1987). Second language grammar: Learning and teaching. New York: Longman.
Schiffrin, D. (1987). Discourse markers. Cambridge, UK: Cambridge University Press.
Language Learning & Technology
150
Martina Mollering
Teaching German Modal Particles...
Scott, M., & Johns, T. (1993). MicroConcord. Oxford, UK: Oxford University Press.
Sigley, R. (1997). Text categories and where you can stick them: A crude formality index. International
Journal of Corpus Linguistics, 2(2), 199-237.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford, UK: Oxford University Press
Steinmüller, U. (1981): Akzeptabilität und Verständlichkeit - Zum Partikelgebrauch von Ausländern. In
H.Weydt(ed): Partikeln und Deutschunterricht (pp.137-148). Heidelberg: Groos.
Thurmair, M. (1989). Modalpartikeln und ihre Kombinationen [Modal particles and their combinations].
Tübingen, Germany: Niemeyer.
Thurstun, J., & Candlin, C. N. (1997). Exploring academic English: A workbook for student essay
writing. Sydney: National Centre for English Language Teaching and Research.
Tribble, C., & Jones, G. (1989). Concordances in the classroom. Harlow, UK: Longman
Trömel-Plötz, S. (1979). "Männer sind eben so": Eine linguistische Beschreibung von Modalpartikeln
aufgezeigt an der Analyse von dt. eben und engl. just [“Männer sind eben so”: A linguistic description of
modal particles featuring the analysis of German “eben” and English “just”]. In H. Weydt, (Ed.), Die
Partikeln der deutschen Sprache [The particles of the German language] (pp. 318-334). Berlin: de
Gruyter.
Vorderwülbecke, K. (1981). Progression, Semantisierung und Übungsformen der Abtönungspartikeln im
Unterricht Deutsch als Fremdsprache [Progression, semanticization and exercise forms for teaching
toning particles in German as a foreign language]. In H. Weydt (Ed.), Partikeln und Deutschunterricht
[Particles in teaching German] (pp. 149-160). Heidelberg: Groos.
Weydt, H. (1969). Abtönungspartikel. Die deutschen Modalwörter und ihre französischen Entsprechungn
[Toning particles. The German modal words and their French equivalents]. Bad Homburg, Germany:
Gehlen.
Weydt, H. (Ed.). (1979). Die Partikeln der deutschen Sprache [The particles of the German language].
Berlin: de Gruyter.
Weydt, H. (Ed.). (1981). Partikeln und Deutschunterricht [Particles and the teaching of German].
Heidelberg, Germany: Groos.
Weydt, H. (Ed.). (1983). Partikeln und Interaktion [Particles and interaction]. Tübingen, Germany:
Niemeyer.
Weydt, H. (Ed.). (1989). Sprechen mit Partikeln [Particles in talk]. Berlin: de Gruyter
Wichmann, A., Fligelstone, S., McEnery, T., & Knowles, G. (Eds.). (1997). Teaching and language
Corpora. New York: Longman.
Witton, N. (1994). Micro-Concord presented, reviewed and compared with the Mini-Concordancer. OnCall, 8(2), 33-40.
Language Learning & Technology
151
Language Learning & Technology
http://llt.msu.edu/vol5num3/murphy/
September 2001, Vol. 5, Num. 3
pp. 152-173
THE EMERGENCE OF TEXTURE: AN ANALYSIS OF THE FUNCTIONS
OF THE NOMINAL DEMONSTRATIVES IN AN ENGLISH
INTERLANGUAGE CORPUS
Terry Murphy
Yonsei University, Seoul
ABSTRACT
This study uses the concept of "emergent texture" to analyze the corpus behavior of
the four nominal demonstratives -- this, that, these, and those -- in an interlanguage
corpus created at Yonsei University in the Fall of 1999. "Emergent texture" refers to
the manner in which interlanguage texts gradually develop their use and control of
the grammatical and semantic means used to establish textual cohesion. The study
investigates a corpus of 109 single paragraphs created at Yonsei University in the Fall
of 1999. The concept of markedness is emphasized as a way of mediating the debate
over the issue of interlanguage development, linking this to the extensive description
of inter-sentential cohesive relations in Halliday and Hasan's 1976 study, Cohesion in
English. The investigation proper begins with the analysis of a single sample
paragraph of low-level interlanguage taken from the corpus in order to establish a
frame of reference for what follows. It then examines various aspects of
interlanguage cohesion within the corpus as a whole, including reiteration, synonyms
and near-synonyms, the behavior of the nominal group, and cataphoric reference.
The paper concludes with a discussion of future research possibilities in the area of
interlanguage cohesion.
THE INVESTIGATION OF SECOND LANGUAGE WRITING
The investigation of the written compositions of second language learners has been a central issue for
applied linguists since the mid-1960s. Although the school of contrastive rhetoric (the study of the
cross-cultural aspects of second language writing) remains highly influential, there has been a recent
and growing interest in using corpus analysis to understand this area of second language learning
(Beaugrande, 1997; Connor, 1996; Freedman, Pringle, & Yalden, 1979; Kaplan, 1966; Kroll, 1990).
One central concern for applied linguists interested in corpus analysis has been the problem of how to
measure the learner's growing second language sophistication (Laufer & Nation, 1998; Shaw & Liu,
1998). A majority of the applied linguists who have investigated this issue take some definition of
lexical richness to be central in any adequate account of measurement. In other words, when
approaching the issue of the development of second language writing, applied linguists draw a sharp
line between the categories of lexis and grammar in order to focus their attention on the
development of lexis. This decision is reflected in the fact that virtually all such measures, including
those used by the major available software, rely on the notion of a stable grammatical denominator
in their calculations of lexical richness. This paper marks a departure in suggesting that the concept
of "emergent texture," which offers itself as a measure of the development of interlanguage grammar
and semantics, may prove to be useful for analyzing some central aspects of the development of
second language writing. Utilizing the basic framework of the work of Halliday and Hasan on first
language textual cohesion, the present study demonstrates its usefulness in a detailed analysis of an
interlanguage corpus created at Yonsei University in Seoul, Korea.
Copyright 2001, ISSN 1094-3501
152
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
AIMS OF THE STUDY
This study situates itself within the emerging schools of corpus and textual linguistics. The research
was carried out on an interlanguage corpus created during the Fall 1999 semester, assembled from the
various genres of single paragraph compositions written by two undergraduate writing classes at
Yonsei University. Utilizing the basic framework of textual cohesion outlined in Halliday and Hasan's
Cohesion in English (1976), the study analyses the manner in which certain basic grammatical units,
the nominal demonstratives, become progressively integrated into second language writing. An
underlying assumption of the study is that the concept of markedness, associated with functional
grammar and text linguistics, might be used to shed light on this process of integration (Greenberg,
1966; Halliday, 1994; Jakobson, 1957; Rutherford, 1982). The degree of interlanguage cohesion is a
useful measure of the writer's ability to make significant choices among grammatical and semantic
elements. The basic approach adopted here to the issue of interlanguage development is dialectical
and qualitative. In the words of Lucien Goldmann (1964),
the only possible starting-point for research lies in isolated abstract empirical facts;
the only valid criterion for deciding on the value of a critical method lies in the
possibility which each may offer of understanding these facts, of bringing out their
significance and the laws governing their development. … The advance of knowledge
is thus to be considered as a perpetual movement to and fro, from the whole to the
parts and back to the whole again, a movement in the course of which the whole and
the parts throw light upon one another. (p. 5)
An initially qualitative approach is necessary to avoid the risk of a probabilistic study flattening out
what is most distinctive about interlanguage: its existence as a series of snapshots, highlighting
uneven patterns of textual sophistication. Second language corpus analysis involves the investigation
of a whole series of texts and textual component parts at various stages of development. It is neither
possible nor immediately desirable in the study of interlanguage to attempt what Halliday (1992)
elsewhere rightly suggests ought to be the approach taken to the study of the first language:
"grammar [has] to be studied quantitatively, in probabilistic terms" (p. 61).
In spite of this large caveat, this analysis does attempt to make meaningful and potentially verifiable
statements regarding interlanguage. Moreover, it accepts that the true measure of second language
textual development is what is currently known about the whole of first language textual behavior,
including the massive advances in the accuracy of judgments about the English language associated
with the development of corpus linguistics in the 1990s. Nevertheless, while corpus linguistics has
demonstrated the falseness of many previously held intuitive judgments about language, this does not
mean that linguists are free to dismiss previous work merely because that work happens to predate
the era of corpus analysis. In the first place, it is possible to make a strong case for Cohesion in
English as a significant precursor of corpus linguistic work proper. This is because the work employs
actual texts in its analysis of texture, as might be expected from Halliday's commitment to the
quantitative and probabilistic study of grammar. More importantly, recent corpus analysis has served
to extend the previous work of Halliday and Hasan rather than undermine it, most notably in the
case of the nominal demonstratives themselves (McCarthy 1994).1
The chief merit of using the theoretical framework set out in Cohesion in English in a corpus-based
analysis of second language texture, however, is the promise that this holds out for rapid progress in a
new area of research. Naturally, if the empirical results obtained through a corpus analysis begin to
diverge widely from the work of Halliday and Hasan, this framework will need modifying or
replacing. Until such time, however, it seems safer to employ a widely known framework than to
attempt to devise a new one in the course of ongoing second language research. The study of second
Language Learning & Technology
153
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
language development presents such a variety of other complications that it seems wise to reduce the
linguistic difficulties where this is possible. Finally, the use of this theoretical framework has the
additional merit of encouraging contributions from other scholars, particularly those already working
in the fields of functional and textual linguistics.
EMERGENT TEXTURE AND MARKEDNESS
The concept of "emergent texture" refers to the manner in which interlanguage texts gradually
extend their use and control of the grammatical and semantic means used to establish textual
cohesion. The development of interlanguage texture encompasses the broad range of textual devices
for achieving cohesion, including the use of reiteration, synonyms and near-synonyms, the behavior
of the nominal group, and cataphoric reference. The study attempts to account for these emergent
textual patterns in terms of the concept of markedness. It argues that the concept of markedness
helps to explain why the interlanguage texts examined in this study develop in the manner they do.
Growing interlanguage textual sophistication is a function of the increased ability of the second
language learner to experiment with the marked members of sets. In other words, the emergent
texture of interlanguage texts becomes richer because of the increasing ability of the writer to make
marked, as opposed to unmarked, grammatical and semantic choices. For example, low-level
interlanguage texts tend to achieve nominal demonstrative cohesion almost exclusively by means of
the use of the definite and indefinite articles. In contrast, more sophisticated interlanguage texts
deploy a much wider range of nominal demonstrative reference. The study argues that the concept of
emergent texture has potential in the analysis of the wider variety of grammatical, semantic, and
lexical elements involved in the achievement of cohesion. With this in mind, the paper concludes
with a discussion of some possible areas for the future investigation of emergent texture in
interlanguage development.
A brief discussion of the history of markedness as a linguistic concept will serve to secure its
legitimacy for the analysis of corpus texts, including interlanguage ones. Markedness was first utilized
by N. S. Trubetzkoy of the Prague linguistic circle in his phonological analysis of the neutralization
of distinctive opposites in Grundzüge der Phonologie (1939) (Greenberg, 1966, p. 11). Phonological
neutralization is the process in which distinctive phonemes in given environments lose their
distinctiveness, resulting in the regular appearance of the one unmarked phoneme. Trubetzkoy was
the first linguist to note that in phonemic pairs differing only in a single feature of the same
category, such as voiced or unvoiced, aspirated or unaspirated phonemes, it was the unmarked
phoneme that regularly appears in neutralized environments. In other words, there is a hierarchical
relation between the two pairs of the opposition (Waugh, 1976, p. 89). For example, it is always the
unvoiced obstruent phoneme that occurs in final word or sentence position in German. Similarly, in
classical Sanskrit, when the opposition between aspirated and unaspirated stops in sentence final
position is neutralized, the unaspirated phoneme appears (Greenberg, 1966, p. 13). In German,
therefore, it is the unvoiced phoneme that is unmarked; in Sanskrit, it is the unvoiced and unaspirated
phonemes. Generally speaking, the quality of being unmarked is associated with the absence of a
given feature, while markedness is associated with the presence of that same feature.
Roman Jakobson later extended the idea of markedness to the study of grammatical categories and
semantics, drawing a basic distinction between phonological distinctive features and
lexicogrammatical conceptual features (Waugh, 1976, pp. 89-100). In a study published in 1957, he
attempted a general definition of markedness, which allowed for the incorporation of the various
levels of phonology, grammar and semantics.
The general meaning of a marked category states the presence of a certain property
A; the general meaning of the corresponding unmarked category states nothing about
Language Learning & Technology
154
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
the presence of A and is used chiefly but not exclusively to indicate the absence of A.
(quoted in Greenberg, 1966, p. 25)
Jakobson's definition succeeded in substantially widening the concept of markedness beyond the realm
of phonological analysis. It also allowed for the analysis of cases where more than one type of
markedness functioned simultaneously. A good example of this phenomenon is the simultaneous
operation of morphological and semantic unmarkedness in the word actor. In certain environments,
actor is to actress as "male thespian" is to "female thespian." However, actor is the semantically
unmarked of the two terms since only actor may be predicated of both male and female thespians.
Actress is neutralized by the term actor in given environments because actress can only refer to
female thespians. Actress is morphologically the more complex of the two terms, requiring the
addition of an extra morpheme. Actor is therefore also the unmarked morphological term. More
broadly, in the terms provided by Jakobson's definition, actress indicates the presence of femaleness,
while actor may be used indiscriminately in a majority of instances to refer to thespians regardless of
gender (Clark & Clark, 1978, p. 231; Greenberg, 1966, pp. 26-27). In the series of scholarly
conversations conducted with his wife, Krystyna Pomorska, first published in French in 1980 and
later translated into English as Dialogues (1983), Jakobson returned once again to this concept of
markedness, suggesting:
The conception of binary opposition at any level of the linguistic system as a
relation between a mark and the absence of this mark carries to its logical conclusion
the idea that a hierarchical order underlies the entire linguistic system in all its
ramifications. … On the phonological level, the position of the marked term in any
given opposition is determined by the relation of this opposition to the other
oppositions in the phonological system -- in other words, to the distinctive features
that are either simultaneously or temporally contiguous. In grammatical oppositions,
however, the distinction between marked and unmarked terms lies in the area of the
general meaning of each of the juxtaposed forms. The general meaning of the marked
term is characterized by the conveyance of more precise, specific and additional
information than the unmarked term. (p. 97)
The close relationship between the notion of markedness in both grammar and lexicon offers a
certain degree of assurance that it is the same phenomenon under investigation in both cases.
William Rutherford's 1982 essay, "Markedness in Second Language Acquisition," represented an
important attempt to extend the concept of markedness to the field of second language acquisition.
Although he was interested in attempting to use the concept of markedness "to elucidate essentially
two separate aspects of second language acquisition: transfer … and order of acquisition" (Rutherford,
1982, p. 98), it is only the second aspect that concerns the present study. Rutherford makes the
general case for the important of markedness for interlanguage development in the following way:
There seems to be a lot of interlanguage data that -- whatever the original purpose of
their elicitation -- reveal a tendency on the part of all learners to impose on the
target language a certain structural clarity, transparency, or … explicitness. Such a
tendency can be adduced by the learner's preference for coordination over
subordination, by the retention of pronominal reflexes in relative clauses, and by the
apparent preference (at least in English) for constructions in which raising has not
taken place over those "equivalent" expressions in which it has. (pp. 98-99)
In his essay, Rutherford (1982) went on to suggest the importance of considering "the discourse
function of syntactic constructions" in any use of markedness in studies of interlanguage
development (p. 101). This suggestion is important because it is necessary to distinguish among
choices that are motivated by the constraints of text or discourse development and those that are
Language Learning & Technology
155
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
genuinely instances of interlanguage limitation. Rutherford's essay suggested in conclusion that there
was a need to use markedness theory to move beyond "the distributional characteristics of the
exponents of formal syntax [to achieve] a greater understanding of more complex language" (p.
103).
The central problem with Rutherford's subsequent study, Second Language Grammar: Learning and
Teaching (1987) is that it equivocates on the use of natural format data in order to achieve this
greater understanding. According to Rutherford, consciousness-raising, which is one of the main
themes of the book, takes place at a point between two extremes. These two "extremes" are "the
natural appearance of a grammatical phenomenon in 'authentic' text on the one hand and its
contextless explicit formulation on the other" (p. 153). In other words, Rutherford's earlier
insistence on the use of markedness theory has been compromised (p. 103). The central concern
with consciousness-raising, which was absent from the 1982 essay, implies a renewed commitment to
what Robert de Beaugrande calls "the rewriting of natural language as formal notation" (Beaugrande,
1997, p. 41). Shorn of any theory of language in which to embed markedness theory, Rutherford
abandoned the attempt to use the concept as a means to analyze interlanguage text and discourse
(personal communication, March 2, 2000).
This paper then is an attempt to complete the unfinished work of Rutherford's 1982 essay. It
attempts to do this by embracing the functional linguistic concept of markedness of Halliday and his
associates within a project committed to the investigation of actual interlanguage corpora. In this
way, it may be possible to achieve that "greater understanding of more complex language" promised
in Rutherford's essay, by means of an analysis of the function of the nominal demonstratives in the
emergence of texture.
THE CONCEPT OF TEXTURE
Interlanguage texts exhibit only an elementary or emergent texture because of the underdevelopment
of the system of directives for creating textual cohesion. Emergent texture is also therefore a
measure of the capacity of a given interlanguage text to function as a textual unity. According to
Halliday and Hasan, "A text has texture, and this is what distinguishes it from something that is not a
text. It derives this texture from the fact that it functions as a unity with respect to its environment"
(1976, p. 2). In the sense of the term put forward by Halliday and Hasan, the texts of second
language learners offer varying degrees of texture, ranging from those produced with virtually no
consideration given to the relationship among sentences or particular stretches of text to those
which are barely distinguishable from texts produced by native writers. Another way of putting this is
to say that low-level interlanguage texts are distinguished by their relative lack of cohesion; low-level
interlanguage texts demonstrate a limited range of facility and concern with the significant relations
among cohesive ties within the text. As Halliday and Hasan note,
"Cohesion" is defined as the set of possibilities that exist in the language for making
text hang together: the potential that the speaker or writer has at his disposal. …
Thus, cohesion as a process always involves one item pointing to another; whereas
the significant property of the cohesive relation … is the fact that one item provides
the source for the interpretation of another. (p. 19)
Cohesion within a text is established by means of the presence of the five major categories of
cohesive ties: ties of reference, substitution, ellipsis, conjunction, and lexis (Halliday & Hasan, 1976,
p. 4). The class of reference ties function as directives indicating that information is to be retrieved
form elsewhere.
Language Learning & Technology
156
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
Demonstrative reference is reference by means of location. The writer locates this type of reference
along a scale of proximity. This scale is defined in terms of the selective participation and
circumstances that define the textual occasion (Halliday & Hasan, 1976, p. 37). Demonstrative
reference is therefore distinguished from both personal reference and comparative reference.
Personal reference is defined by its function in the speech situation; comparative reference is a form
of indirect reference that is established by means of identity (p. 31). The eight demonstratives that
together constitute the grammatical means for establishing demonstrative reference may be divided
into two basic sets. The more important of the two sets is the one that selectively locates the text
with respect to participant and number: this, that, these, those. The other set, which locates the text
with respect to time and place, is less significant: here, there, now, then. The major grammatical unit
for analysis for the investigation of this first set of demonstratives is the nominal group. As Halliday
and Hasan point out,
What distinguishes reference from other types of cohesion…is that [it] is
overwhelmingly nominal in character. With the exception of the demonstratives,
here, there, now, and then, and some comparative adverbs, all reference items are
found within the nominal group. (p. 43)
It may well be the case that the second set of demonstratives plays the greater, or at least a
significantly more prominent, role in the formation of cohesion in spoken and extemporaneous
texts. However, interlanguage composition at the university level approximates the stereotypical
model of writing outlined by Douglas Biber: students aim to create texts that are structurally
complex, unified, abstract, and free from most forms of situation-dependent reference (Biber, 1988,
p. 37). The nominal demonstratives alone will be the focus of this corpus investigation of the
emergence of cohesion and texture.
THE NOMINAL DEMONSTRATIVES AND EMERGENT TEXTURE
The basic hypothesis of this study is that interlanguage textual development is revealed in an
increasingly sophisticated deployment of the nominal demonstratives. Briefly put, the absence or
presence of the four nominal demonstratives in a given interlanguage text is a central indicator of its
emergent texture. Patterns of interlanguage cohesive development ought to be consistent with what
is known about the complexities involved in the formation of texture. The division of labor among
the nominal demonstratives in Standard English is somewhat unusual. As Halliday elaborates in the
second edition of An Introduction to Functional Grammar (1994):
Given just two demonstratives, this and that, it is usual for that to be more inclusive; it
tends to become the unmarked member of the pair. This happened in English; and in
the process a new demonstrative evolved which took over and extended the
'unmarked' feature of that – leaving this and that once more fairly evenly matched.
This is the so-called 'definite article' the. (p. 314)
In other words, the relations among the four nominal demonstratives are made somewhat complex in
Standard English by the evolution of the lexical item, the. In addition, there is a distinction to make
between the unmarked demonstrative when functioning as a Head and when functioning as a Deictic.
Historically, in fact, both it and the are reduced forms of that; and, although it now
operates in the system of personals, both can be explained as being the 'neutral' or
non-selective type of the nominal demonstratives – as essentially one and the same
element, which takes the form it when functioning as Head and the when functioning
as Deictic. (Halliday and Hasan, p. 58)
Language Learning & Technology
157
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
What this implies is that low-level interlanguage texts will rely heavily on the use of the definite
article to establish cohesion. The cohesion of low-level interlanguage texts will mostly takes the
form of strings of anaphorically referenced lexical items introduced by the Deictic the, with further
cohesion provided by the use of it as a Head. The four marked nominal demonstratives therefore will
be conspicuous mostly by their absence.
In an article building on the work of Halliday and Hasan and other linguists who have examined the
functioning of it, this and that in Standard English, Michael McCarthy has suggested a slight
refinement of this basic scheme. McCarthy's work is particularly useful since it bases its conclusions
on the analysis of a large sample of genuine texts. According to McCarthy:
1. It is used for unmarked reference within a current entity or focus of attention.
2. This signals a shift of entity or focus of attention to a new focus.
3. That refers across from the current focus to entities or foci that are non-current, non-central,
marginalizable or other-attributed. (McCarthy, 1994, p. 275)
The table reproduced below helps explicate the distinction between unmarked, or non-selective, and
marked reference among the nominal demonstratives. In its stark division between the choice
between non-selective and selective, the table, which has been modified slightly from that presented
in Halliday and Hasan's book to stress the primacy of non-selection, highlights why the definite
article tends to predominate in low-level interlanguage texts:
Table 1. Demonstrative Reference (modified presentation from Halliday & Hasan, p. 38)
Semantic category
Non-selective
Selective
Grammatical function
Modifier
Modifier / Head
Adjunct
Class
Determiner
Determiner
Adverb
this these
that those
here now
there then
Proximity
near
far
neutral
the
Low-level interlanguage texts possess only the most rudimentary system for specifying and
identifying chains of lexical items in the text, nothing more. In comparison with these two uses of
the unmarked demonstratives, each of the four forms this, that, these, and those are marked. In other
words, the theory of markedness can furnish an explanation for why low-level interlanguage texts
tend to eschew the use of the nominal demonstratives. In turn, this helps to explain the fact that
low-level interlanguage texts possess only emergent texture, the upshot of their unsophisticated
deployment of the devices for achieving suitable levels of cohesion.
The basic distinction in the deployment of the marked demonstratives is in relation to the point of
view of the writer of the text. Within the text, this is used to make anaphoric reference to something
that has just been mentioned by the writer or that is in some other way being taken as "near." The
singular demonstrative that is used anaphorically to indicate something that is being taken as "far"
from the writer's point of view (Halliday, 1994, pp. 314-315). Similarly, the nominal
demonstratives, these and those, differentiate between proximate and remote plural reference from
the point of view of the writer. Since "pro-forms save processing time by being shorter than the
expressions they replace," the greater frequency of the four demonstratives in a given interlanguage
text is usually associated with the writer's ability to create efficient texts (Beaugrande & Dressler,
1981, p. 64). The marked nominal demonstratives are thus important in establishing the coherence
or structure of a mature interlanguage text.
Language Learning & Technology
158
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
The distinction that Halliday and Hasan (1976) make in relation to the unmarked nominal
demonstratives the and it also applies to the marked nominal demonstratives this, that, these, and
those. In general, the demonstrative this will occur as a Modifier in sentences such as this tree is an
oak or as Head in sentences such as this is an oak. In low-level interlanguage texts, the presence of
this as a Modifier ought to set definite restrictions on the lexical sophistication of the nominal group
to which it belongs. One mark of interlanguage textual development is observed in the gradual
elaboration of the linguistic environment in which this is discovered functioning as a Modifier.
Nevertheless, the principal function of this in what appears to be a majority of extended English
language texts is as an indicator of extended reference (Halliday & Hasan, 1976, p. 66). If Halliday
and Hasan are right, growing interlanguage sophistication will be revealed in the gradual reorientation
of the demonstrative adverb this away from its function as simple Modifier or more elaborate Head
toward its use as an indicator of extended reference within the text. In other words, this will occur
more frequently and in a wider variety of contexts in more sophisticated interlanguage texts. Its use
will gradually extend to the introduction of nominal groups used to refer to segments of texts as
linguistic acts in their own right. Sophisticated interlanguage texts will include nominal groups with
the marked demonstratives, the textual function of which are "labels for stages of an argument,
developed in and through the discourse itself as the writer presents and assesses his/her own
propositions and those of other sources" (Francis, 1994, p. 83).
Anaphoric reference tends to predominate in the interlanguage texts exhibiting the least cohesion.
One upshot of this is that sophisticated interlanguage texts will exhibit less imbalance in their ratios
of anaphoric and cataphoric reference. In other words, examples of cataphoric cohesion, which may
involve the use of either this or here, will begin to emerge at higher levels of interlanguage
development. In all likelihood, however, the emergence of cataphoric reference will consist largely in
instances of what Halliday and Hasan refer to as grammatical cataphoric reference. In other words,
the majority of instances of structural cataphora -- "the simple realization of a grammatical
relationship within the nominal group" -- will be non-cohesive, even in high level interlanguage texts
(Halliday & Hasan, 1976, p. 68). Though highly revealing as examples of collocational fluency,
structural cataphora is not an example of a cohesive tie and does not enter into the formation of
texture. In contrast, examples of genuine cataphoric reference, though occurring with relative
infrequency, may be evidence for the relative sophistication of a given sample of interlanguage.
There is a necessary caveat, however. Particular genres appear to offer different possibilities for
actualizing lexical and grammatical arrangements. Process paragraphs, for example, are an obvious
example of a paragraph genre that allows for the actualization of genuine cataphoric reference. In
this sense, it may prove more useful to analyze sub-corpora of particular genres in an effort to isolate
more quickly the difference between texts with developed texture and those that employ
compensatory strategies for achieving more limited forms of cohesion.
METHODS AND MATERIALS
The interlanguage corpus for this research project was created over the course of the Fall 1999
semester by the students enrolled in my Writing and Beginning composition classes at Yonsei
University. During the first 2 weeks of the new term, diskettes were distributed to all the students
who had enrolled. The students were advised that the work that they would submit during the course
of the semester would subsequently form part of an interlanguage corpus. They were told to submit all
their work on the diskette, together with paper copies of the initial drafts of each assignment, for
collection on scheduled dates throughout the semester. By the end of the semester, 109 single
paragraphs had been collected from the students. In terms of genre representation, the corpus
consists of 38 samples of illustration, 27 samples of description, 18 samples of comparison/contrast,
Language Learning & Technology
159
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
11 samples of process, and 11 samples of persuasion. Sample titles from the illustration genre,
together with the word count, are as follows:
"My Family's Three Values" (253 words),
"Painful Experience Often Teaches Valuable Lessons" (301),
"Personality Through Clothes" (136),
"About My Mother I Most Admire and Love" (383),
"Buddha as a Real Egalitarian" (336), and
"Kim Ku, The Only Politician Whom I Admire (296).
The description paragraphs include
"A Blue Man on the Rainy Day" (280 words),
"A Possession I Value (272),
"My Crowded but Comfortable Room" (318),
"An interesting person" (245), and
"My Favorite Bar or Restaurant" (257).
The comparison and contrast paragraphs include
"My Best Friends Eun Lang and Hae Won: the N and S Poles of a Magnet" (307),
"My Personality: in Childhood and as a University Student" (285),
"The Real Face of University Life" (226),
"My Two Completely Opposite Friends" (295), and
"The Movies; The Christmas in August and A Letter" (233).
The process paragraphs include the titles
"How To Appear More Intelligent Than You Are" (235),
"How to Break up with Your Boy Friend" (434),
"How to Break Up With Your Girl Friend" (349), and
"How to Care a Hangover" (332).
Finally, the persuasion paragraphs include
"Suh Kap-sook, a Case for Censorship?" (231),
"Globalization: Ideology or Reality" (447),
"The Brain Korea 21" (410), and
"Views on the Millennium -- Korean Economy" (378).
The total running length of the corpus is 31,641 words. The paragraphs vary in length from a low of
123 words in the case of "Pablo Picasso" to a high of 603 words for "My Favorite Coffee Shops I
Highly Recommend." The paragraphs in the corpus written by the Writing class students were all
completed by the time of the mid-term examinations. These sets of three paragraphs cover a range
of basic paragraph genres including description, illustration, and comparison/contrast. The paragraphs
in the corpus written by the Beginning Composition students include all five written assignments
required for the course of the semester. The paragraphs include the genres of description, illustration,
process, and persuasion.
A SAMPLE OF LOW-LEVEL INTERLANGUAGE WRITING
It is useful to prelude an extensive analysis of the corpus with an examination of a representative
corpus sample of low-level interlanguage composition. By examining the elementary cohesive ties
within this type of composition, it will become clearer what aspects of the English cohesive system
are subject to development. The following paragraph, entitled "My Favorite Bar or Restaurant," was
Language Learning & Technology
160
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
written by a first-year male student in the Writing class as fulfillment of the requirement for a
descriptive paragraph. Shinchon is the area of bars and restaurants frequented by students
immediately around the front gate of Yonsei University:
1. There are a lot of places in Shinchon I often go. "Backstage" is my favorite bar
where we can enjoy music videos on screen with kinds of drink. I will introduce here
to you. Descending steep stairs, you can see a filthy door to which diverse ad-posters
attach. After opening the door, the air in the inner part is thick with tobacco smoke
and it is too dark for here to see the front for a while. To the left of the door, a welllighted counter is opposite to two large pillar stuck to posters of famous rock bands.
The counter is filled with many video tapes and various beverages. In front of the
door or the counter, sofas are put from left to right facing a large screen which
displays all sorts of rock music clips. On the screen you can see several genre clips
that are from USA, Japan, Europe and even the Third World. Near the screen stand
four huge speakers somewhat-broken by careless persons. In several places, there are
some TV that is for those who are far from main screen or want to appreciate video
clips in detail. Around the wall adhere some pictures, posters and scribbles on the base
of grotesque wallpapers. You can feel this place so strange if you are not accustomed
to dark atmosphere or rock music. However, Backstage will be your best friend if you
pay attention to rock or be your eccentric fellow if you have an eye for the unknown
world.
There are a number of points that can be made about the texture of this particular paragraph. The
most basic point is this: at levels of development represented by texts like this, interlanguage texts
rely almost exclusively on the neutral non-selective the to establish textual cohesion. The second
point is that interlanguage texts at this level of competence reveal a definitely limited capacity for
lexical reiteration. According to Halliday and Hasan (1976),
reiteration is a form of lexical cohesion which involves the repetition of a lexical
item, at one end of the scale; the use of a general word to refer back to a lexical item,
at the other end of the scale; and a number of things in between -- the use of a
synonym, near-synonym, or superordinate. (p. 278)
Reiteration in texts such as "My Favorite Bar or Restaurant" takes place almost exclusively at the
end of the scale marked out by repetition. In other words, reiteration as a form of lexical cohesion in
interlanguage texts like this involves simple lexical repetition and the neutral non-selective use of
the definite article as an anaphoric device. In addition, there is only one use of the word it as a Head:
"After opening the door, the air in the inner part is thick with tobacco smoke and it is too dark for
here to see the front for a while." This sentence is an example of it as a relational attributive Head, a
form of non-cohesive grammatical cataphora (Halliday, 1994, p. 143). In this clause, it could be
replaced as Subject by the circumstantial demonstrative, here. There is one other use of the
circumstantial demonstrative in the composition: I will introduce here to you. Both of these citations
precede the subsequent use of the marked nominal demonstrative, this place. There is thus only a
single citation for this. What is more, this citation is in relation to the subject of the description
itself and does not occur until the penultimate sentence: "You can feel this place so strange if you are
not accustomed to dark atmosphere or rock music." The nominal group this place is an example of
what Francis has called a "retrospective label," one that "serves to encapsulate or package a stretch
of discourse" (Francis, 1994, p. 85). As Francis suggests, the central defining quality of a
retrospective label is that "there is no single nominal group to which it refers: It is not a repetition or
a "synonym" of any preceding element. Instead, it is presented as equivalent to the clause or clauses
it replaces, while naming them for the first time" (Francis, p. 85). It is a working hypothesis that the
first labels to emerge in low-level interlanguage texts are retrospective labels that encapsulate the
Language Learning & Technology
161
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
meaning of the entire text itself, echoing, if they echo anything at all, the title of the composition.
At a higher level of interlanguage, advance labels, in which "the label precedes its lexicalization"
(Francis, p. 83), will start to emerge. Once again, it might be expected that advance labels would be
used in the first place to indicate the purpose of the entire text. However, the distinction in single
paragraphs between advance and retrospective labels that encapsulate the meaning of entire texts and
those that encapsulate only a portion of them is fuzzy. In order to demonstrate the correctness or
otherwise of these more or less intuitive judgments, a contrastive analysis of the extent of cohesion
and labeling in a corpus of five paragraph essays will be necessary.
"My Favorite Bar or Restaurant" contains 15 instances of the use of the non-selective definite article
the. Among this total there are 7 instances of specific anaphoric reference back to a previously
introduced noun: a well-lit counter, a large screen, and a filthy door. Moreover, in each one of these
7 instances, the repetition takes the simplest form of unadorned Modifier and Head. In other words,
no premodifying elements are realized; and the Head is a form of cohesion achieved though
repetition rather than lexical modification. In addition, a large number of references are explained by
the content-free status of the definite article. As Halliday and Hasan write, "the definite article …
merely indicates that the item in question IS specific and identifiable; that somewhere the
information necessary for identifying it is recoverable" (1976, p. 71). The fact that this text offers a
description of the interior of a favorite bar in Shinchon serves to explain the references to the air in
the inner part, the left, the front, Around the wall, and the base of grotesque wallpapers.
Nevertheless, even this sample of interlanguage bears out Halliday and Hasan's contention that
"purely anaphoric reference never accounts for a majority of instances [of cohesive textual reference
in any textual sample, written or spoken]" (1976, p. 73). Of the 15 examples of reference involving
the definite article, seven of them -- or approximately one half -- are anaphoric. One possible
implication of Halliday and Hasan's work is that samples of low-level interlanguage are characterized
by the relative absence of cataphoric reference. Further corpus analysis will reveal whether the
relative sophistication of an interlanguage text is measurable in terms of the ratio of anaphoric to
non-anaphoric reference. In this particular example of low-level interlanguage, the ratio is
approximately 50:50, which seems unusually tilted toward anaphoric reference.
As has already been suggested, this interlanguage text makes only limited use of the demonstratives
themselves. There are, for example, no textual citations for that or these. There is one example of a
structurally cataphoric (but therefore non-cohesive) instance of those in the nominal phrase: for
those who are far from main screen or want to appreciate video clips in detail. The dominance of the
definite article is a central indication of the linguistic absence of a developed capacity for the type of
nominal demonstrative textual pointing. Moreover, where there are nominal demonstratives present,
the text has previously established a "right to point" in the form of a series of collocationally
significant semantic references (cf. Halliday & Hasan, 1976, pp. 284-288). In the case of the text
under consideration, this series of references includes the steep stairs, filthy door, and the air in the
inner part of this place. In this way, the interlanguage text indicates that the information necessary
for the identification of this place is textually available.
From this brief analysis, it is possible to draw out a number of working hypotheses about the
emergent texture of low-level interlanguage. First, there is the tendency to rely almost exclusively on
the definite article to carry the burden of cohesion, even if this means confining properly cohesive
relations to anaphoric reference alone. In other words, low-level interlanguage is characterized by an
absence of meaningful cataphoric reference. Furthermore, it seems that whatever exophoric
reference in low-level interlanguage there is takes the form of well-established collocational items
such as the Third World and an eye for the unknown. Thirdly, low-level interlanguage texts such as
this one appear to be characterized by the absence of even the simple kind of forward reference in
which the definite article refers to a modifying element within the same nominal group (Halliday &
Language Learning & Technology
162
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
Hasan, 1976, p. 72). This is in contrast to "most other varieties of spoken and written English
[where the] predominant function [of the] is cataphoric" (Halliday & Hasan, p. 73). Finally, lowlevel interlanguage texts like "My Favorite Bar or Restaurant" lack a developed capacity for signaling
what Michael McCarthy has termed the "topical entity in current focus." The exclusive use of it as
the unmarked demonstrative demonstrates an inability to highlight its noun phrase antecedent for
signaling shifts in textual content (McCarthy, 1994, p. 273). This is because the use of it simply
allows for the continuation of what the text is focusing on; "it does not itself perform the act of
focusing" (McCarthy, p. 271). In contrast, the function of this and that is to "operate to signal that
focus is either shifting or has shifted" (McCarthy, p. 272). The relative absence of the marked
demonstratives therefore indicates an underdeveloped capacity for switching the focus of attention
for purposes of textual interest and complexity. Nevertheless, in spite of these obvious limitations,
"My Favorite Bar or Restaurant" does demonstrate the meaningfulness and usefulness of the concept
of emergent texture. It is from such elementary beginnings that the capacity for establishing
extensive cohesive relations will develop.
THE FUNCTION OF THE NOMINAL GROUP
Relations within the nominal group play a central role in the emergence of texture. An analysis of
the nominal demonstratives consists therefore in a detailed study of the emergence of complex
relations among elements of the nominal group. The relations among these functions in the nominal
group are outlined in the table below:
Table 2. The Structural Analysis of a Nominal Group (from Halliday & Hasan, 1976, p. 40)
The
two
high
stone
walls
along the
roadside
Head
Postmodifier
Structures:
logical
Premodifier
experiential
Deictic
Numerative
Epithet
Classifier
Thing
Qualifier
Classes
Determiner
Numeral
Adjective
Noun
Noun
[Prepositional
Group]
The nominal demonstratives form part of the function of Deictics, which are used for specifying by
identity, both non-specific and specific, including forms of identity based on reference (Halliday &
Hasan). In turn,
the Numerative specifies by quantity or ordination (two trains, next train); the
Epithet by reference to a property (long trains); the Classifier by reference to a
subclass (express trains, passenger trains); and the Qualifier by reference to some
characterizing relation or process (trains for London, train I'm on). (p. 42)
The general limits on interlanguage flexibility evident in the use of lexical reiteration were evident in
the behavior of the nominal group. The strongest evidence for the lack of flexibility was the absolute
predominance of the four demonstratives with an accompanying unadorned Head in the corpus. This
seems a reasonable and even predictable finding. There was a similar tendency for the definite article
to appear in the company of an unadorned Head in the paragraph chosen to illustrate the more
Language Learning & Technology
163
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
general limits of interlanguage lexis and cohesive relations, "My Favorite Bar or Restaurant." The
full list of examples is as follows:
This + Unadorned Head
committee, book, idea, interest, year, allegation, article, bar (3 times), blaze, book (2
times), case, century, era, expectation (2 times), field, house (2 times), information,
instance, investigation, job, kitchen, model, ordeal, person, photograph (2 times),
place (7 times), plan (2 times), pressure, project (2 times), question, reason (4 times),
report, restaurant (2 times), room, rule, self-development, semester, sense, shop,
situation, society, stage, summer, truth, university, way (4 times), year (2 times)
That + Unadorned Head
case, man, method, money, point, policy, position, problem, question, reason, slum,
time, way
A comparison of the use of both singular demonstratives with an accompanying unadorned Head
reveals three lexical items in common. Both demonstratives occur with problem, question, and way.
The marked singular demonstrative that also occurs with a number of other abstract nouns: method,
point, policy, position, reason, and time. There are three examples where that occurs with a concrete
noun: money, slum, and man. Is there a tendency for interlanguage texts to allow this to carry the
burden of lexical reiteration and that to carry the burden of the abstract construction of the unfolding
argument? The corpus does tend to show a pattern of use for that in relation to the past, from texts
of fairly limited sophistication to more obviously complex ones:
2. Pablo Picasso is one of the most creative artists in twentieth century. He was born
in 1881, and died in 1973. Though he was Spanish, he played an active part in
France. At first he studied art in Barcelona, and fixed in Paris since 1904. At that
time, he showed his great interest in mouldering as well as painting.
3. Last summer, on a heavy rainy day, I was sitting on the bench in front of Lotte
Department store, waiting for my friend. At that time, I could see faintly someone
coming towards me from a distance.
The obvious question then is under what circumstances that tends to get used with concrete nouns. In
each of the three citations in the corpus, the demonstrative that is used in establishing a reference
within the past:
4. The Reverend Choi is a leader of Dail Community whom I admire because of his
power of love toward neighbor. He was born in Seoul, 1957 and grew up as a
Christian. One day in 1988, He met a helpless and sick old man in front of railway
station and couldn't pass by him, so he served that man a meal.
5. She was born in the province of Scopeye in Yugoslavia in 1910. After she was
called as a nun she was dispatched to Calcutta in India, where she answered God's voice
to help the poorest among the poor by establishing "Missionaries of Charity". She
had served as she did God in that slum the hurt, poor and sick with whom nobody
wanted to contact until the death of heart disease in 1997.
6. In 1975, it was a time that everything looked safe and stable. The restaurant was
always crowded and she was six months pregnant her fifth baby. However, grandfather
was defrauded of his house and every lands. His friend allured him to invest his money
in a new business, but he ran away with that money.
Language Learning & Technology
164
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
The use of the demonstrative that with more complex nominal groups provides a small amount of
evidence for the idea that its major function is to establish past references. The following paragraph,
which is an extended comparison between the 1970s movie Jaws and the 1990s movie Deep Blue
Sea contains three examples of that used in this way:
7. Jaws and Deep Blue Sea have many similarities despite of a long time gap. They
have screaming girls in bikinis, floods of bloody water and that ominous gliding fin.
In fact, the opening sequence of Deep Blue Sea is almost the same with that of Jaws;
It starts with a few young people attacked by an unseen object under water. Even the
posters, in which a woman is swimming in the sea and a shark is just behind her with a
wide open mouth, are similar. But, these two movies are very different in the way
they scare people. In Jaws we had just one, dumb shark with a heavy mechanical
equipments inside, but none of the scenes of the existing horror movies have ever
been as scary as that moment in Jaws when the shark first lifted its nose out of the
water.
However, there are problems with this argument. If there is a tendency to use the singular
demonstratives in this manner, with this used to establish concrete and present references and that
used to establish abstract and past references, it might be expected that the same tendency would be
discovered with the plural demonstratives. The corpus citations for these and those, however, are
inconclusive, even if the small number of citations for the use of the latter is taken into account.
These + Unadorned Head
accidents, cooks, days (2 times), dishes, examples, expectations (2 times), facts,
governments, instruments, international financial organizations, measures, methods
(2 times), pictures, places, products, reasons (3 times), sections, shops, statistics,
steps (2 times), stories, things (4 times), values, ways
Those + Unadorned Head
reasons, things
At first glance, there does appear to be the same tendency to use the plural demonstrative those to
establish abstract reference. However, the two nouns discovered in the environment of those are also
found more frequently with these: reasons and things. The appearance of things is particularly
important since this word in both its singular and plural form is used as an anaphoric reference in the
forming of texture. If the pronoun it is temporarily excluded, it is the word thing that is the lexical
item used in the establishment of anaphoric reference at the most general of textual levels. The word
thing "usually excludes people and animals, as well as qualities, states and relations, and … always
excludes facts and reports" (Halliday & Hasan, 1976, p. 279). The evidence for a consistent pattern
involving the marked demonstratives is therefore inconclusive. The possibility that interlanguage
texts use this to establish concrete and present references and that to establish abstract and past
references will require the analysis of a larger corpus of interlanguage texts. This analysis would be
useful in understanding whether these uses of this and that relate to interlanguage development.
Specifically, it is of relevance to the issue of the writer's growing ability to distinguish textually a
shift of attention to a new focus and one that "refer[s] across from the current focus to entities that
are non-current," the latter being the first of McCarthy's criteria for distinguishing between the uses
of this and that.
Language Learning & Technology
165
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
THE DEMONSTRATIVES AND COMPLEX NOMINAL GROUPS
The general restrictions that apply to the use of the nominal group with demonstrative reference can
be seen more clearly in the few samples of greater nominal group complexity that occur in the
corpus. It is particularly revealing to examine the manner in which these more complex nominal
groups emerge during the course of textual development. Normally, the text prepares for the
entrance of these nominal groups in significant ways. The general tendency seems to be that
complex nominal groups are "grown" from previously existing textual possibilities (cf. Halliday,
1992, p. 70). Such an explanation accounts for a passage such as this one:
8. In the middle of the coffee shop, you will see an interesting plastic art made of
glass. This seems like the glass boxes piled up from the floor to the ceiling. There are
seven sections in this glass pillar. These sections are empty excluding second and fifth,
which are filled with empty plastic bottles. This special plastic art is associated with
simple and modern mood of here.
The following example of a place description is unusual in its use of more complex nominal groups to
achieve textual reiteration. The description varies its use of accompanying nominal group epithets,
moving from the street to Insadong Street to this small street and ending up with this famous street:
9. Insadong is my favorite place in Seoul, located between the Korea Times Building
and Pagoda Park. The main part of Insadong runs along the street with the same
name. It is located between two east-west running avenues in the downtown area and
either avenue can be considered an entrance. The south end of Insadong starts at
Pagoda Park on Chongno Street. Chongno is itself a major thoroughfare passing an
important business section of Seoul. From Chongno, Insadong Street runs a northwest
diagonal until it reaches Yulgongno, another major avenue. This small street is loaded
with antique shops selling all sorts of Korean antiques and handcrafts. Many stores are
specialty shops featuring items such as chests, other furniture, stationery. But this
famous street is not limited to antiques.
More typically, interlanguage texts tend to employ the concluding sentence to sum up central aspects
of the previous discussion. The following example of the use of a demonstrative with a complex
nominal group occurs at the end of a description of a garden located on the campus of Yonsei
University:
10. Finally, in the winter, the snow covered trees make an awesome scenery. I have
not seen the latter scene yet, but the seniors say that it is wonderful. So I can not wait
until winter. The birds singing and cute squirrels and Korean magpies running around
also make "Chung-Song-Dae" rich in atmosphere. Those who have dreams have
many convincing reasons why they should definitely visit this magnificent garden.
The major examples of complex nominal group reference all occurred in extended comparison or
contrast paragraphs. Examples of these included these two films (2 times), these two movies (2 times),
and these two people. One mark of the lack of sophistication of the majority of texts in the corpus is
thus measured in the strict limits placed on the complexity of the nominal group introduced by the
demonstratives.
REITERATION AND THE LIMITS OF INTERLANGUAGE LEXIS
It is useful at this point to recall the basic idea that reiteration is a form of textual cohesion that
involves a variety of lexical possibilities. In its most basic form, reiteration simply means the
Language Learning & Technology
166
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
repetition of a lexical item. This form of textual cohesion dominates low-level interlanguage texts.
Halliday and Hasan also note, however, that this type of cohesion may also involve the use of a
synonym or near-synonym, a superordinate or the use of a general noun (1976, p. 278).
The general principles behind this is simply that demonstratives, since (like other
reference items) they identify semantically and not grammatically, when they are
anaphoric require the explicit repetition of the noun, or some form of synonym, if
they are to signal exact identity of specific reference; that is, to refer unambiguously
to the presupposition at the identical level of particularization. A demonstrative
without a following noun may refer to some more general class that includes the
presupposed item…. (Halliday & Hasan, 1976, pp. 64-65)
The class of general nouns is defined by Halliday and Hasan as "a small set of nouns having
generalized reference within the major noun classes, those such as 'human noun,' 'place noun,' 'fact
noun' and the like" (p. 274). The general noun operates on a borderline "between a lexical item
(member of an open set) and a grammatical item (member of a closed set)" (p. 274). The list of the
class of general nouns follows:
Table 3. The Class of General Nouns
Class
Examples
human
non-human animate
inanimate concrete noun
inanimate concrete mass
inanimate abstract
action
place
fact
people, person, man, woman, child, boy, girl
creature
thing, object
stuff
business, affair, matter
move
place
question, idea
As Halliday and Hasan (1976, p. 175) note, "a general noun in cohesive function is almost always
accompanied by the reference item the … The most usual alternative to the is a demonstrative…" It
is probably of significance that the only class category to be adequately represented in the corpus in
conjunction with the definite article is the class of human general nouns. There were nine citations
for the people, seven for the person, six for the man (but none for the boy), two for the girl (but
none for the woman), and none for the child. There seemed to be a small but real tendency to use the
girl as the unmarked general noun for women but the man as the unmarked general noun for men.
There was also a single reference to this person in an essay on the subject of admiration for the late
Korean nationalist, Kim Ku. What is striking from the point of view of lexical cohesion is that the
corpus contains virtually no citations from the other classes of general nouns listed by Halliday and
Hasan. The only category in which they appeared was that of general nouns of fact. The corpus
contained one reference to the idea and one to the question. The latter citation, however, occurred
within a portion of quoted text from an English language source. There was one citation for this
question and one for this idea, with no citations for the corresponding plural forms. There was also
one reference to that question but none to that idea, with no citations for the corresponding plural
forms. This general under-representation of the category of general nouns in the corpus has obvious
implications for the ability of these interlanguage texts to achieve the full range of lexical cohesion.
What it means is that these interlanguage texts restrict their use of the items that make up the
category of general nouns to lexical instances of specified anaphoric reference within the text. In
Language Learning & Technology
167
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
other words, with the partial exception of the category of human nouns, these interlanguage texts do
not appear to trade at the level of the abstract general noun. Moreover, a number of the citations
relating to the category of human nouns may be the result of the fact that a suggested paragraph
topic was "The Famous Person I Most Admire." Although the general nouns occur frequently as
lexical items, an entire cohesive level that remains almost entirely unrealized. There are a number of
possible explanations for this: Two will be considered.
The first is that the absence of the full range of abstract general nouns is the result of the lexical
constraints of the genres represented in this corpus. In other words, a different choice of genres,
regardless of interlanguage considerations, will result in a greater overall representation of the class of
abstract general nouns. The second explanation is that this absence represents a significant limitation
on the lexical range actualized in interlanguage itself. It is the second explanation that this paper
favors. The class of general nouns is absent because of the nature of interlanguage texts themselves.
Beyond their function in achieving textual cohesion by virtue of the reference back to a previous
nominal group, general nouns regularly signal the ability of the writer to refer to textual material in
an interpersonal manner (Halliday & Hasan, 1976, p. 276). It seems reasonable to suggest that the
capacity for referring to textual material in this way emerges only at highly sophisticated levels of
interlanguage. Naturally, more extensive corpus analysis will be necessary to support or refute this
working hypothesis. In this respect, one fruitful line of enquiry would be the investigation of a corpus
of five paragraph essays, weighted toward the genres of argument and persuasion. Other things being
equal, such a corpus might be expected to contain examples of inanimate concrete and inanimate
abstract general nouns. The absence of these lexical items would offer further evidence as to whether
the seeming inability to trade at the level of the general noun observed in this corpus represents a
genuine limitation on the achievement of a high level of interlanguage texture.
LABELS, SYNONYMS AND NEAR-SYNONYMS WITH THE DEMONSTRATIVES
In low-level interlanguage texts, there is very little use of synonyms or near-synonyms to achieve
lexical reiteration. At this stage in interlanguage development, the writer's lack of substantial lexical
depth means that the establishment of a basic overall textual meaning takes absolute precedent. The
upshot is that samples of interlanguage offer little evidence of the writer's contemplation of
synonymous or near-synonymous lexical items. The overriding importance of establishing overall
textual coherence explains the early use of anaphorically cohesive nominal groups as retrospective
labels. As Gill Francis explains, a retrospective label "is not a repetition of a 'synonym' of any
preceding element. Instead it is presented as equivalent to the clause or clauses it explains, while
naming them for the first time" (Francis, 1994, p. 85). Certain genres, among them process or
persuasion paragraphs, tend to encourage the use of retrospective labeling. Favored lexical items
found in the corpus to achieve this kind of limited textual coherence in conjunction with the use of
the demonstratives include allegation, case, examples, facts, measures, ordeal, plan, project, reason
(four times), situation, steps (two times), truth, report, and way (four times). The simplest form of
this type of lexical coherence occurs in the following description of a friend of the writer:
11. Moreover, Chang is not afraid of expressing himself, probably because since he
was young, he was encouraged to do what he feels is right. He is very straight forward
in stating his thoughts, therefore he often ends up hurting others' feeling, although
not done on purpose. For this reason, there are many people who love Chang, but
there are also many people who hate Chang
Language Learning & Technology
168
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
The following four paragraph conclusions use virtually identical techniques to achieve textual unity.
The paragraphs demonstrate the use of four different labels introduced by the proximate plural
demonstrative: examples, facts, measures, and steps respectively:
12. Se-lim is a feminine girl, and her garments shows her feminine personality well.
She usually puts on a skirt and a laced blouse and a pair of shoes which have high
heels. She likes cute accessories, too. Through these examples, we can see that
people's personalities affect the style of clothing.
13. There is also some belief that Korean students are spoiled and spend lots of
money on drinking and playing. This idea probably comes from the fact that many
commercial districts are being developed near the university areas, but the reality isn't
always like that. As a matter of fact, most of my university friends have a part time
job to make money for tuition. Also nowadays books have become really expensive,
so more than ever, huge amounts of money go into buying books for class. From
these facts, one could see that most Korean students can't afford to be spoiled.
14. You can speak to her on her face or on the telephone. You can also write her
when you are scared to tell her the truth directly. It is the most powerful and definite
way to make her know your resolution since she would perfectly know what you are
thinking and what you are going to do. When you are looking for a reliable way to
break up with your girlfriend you can change your attitude toward her, take symbolic
actions and tell her what you have in our mind. These measures help you to get
separated from your girlfriend without difficulty.
15. In sauna, 20 to 30 minutes of bath and 10 to 20 minutes of sauna will make your
body relaxed, and then you take a nap in sleeping room in the sauna. After two or
three hours of sleep, you take a shower and come home. All these process in the
sauna will finally make you sober. In the evening, you have a regular dinner. However,
it is extremely important to have a dinner because if the sulong-tang was the first
step to cure a stomach-ache, having a regular meal is the final step. When you finish
the dinner your stomach-ache will go away. Follow these steps and you will
completely forget about your hangover.
A related use of the plural noun reasons in conjunction with a fronted these occurred twice as an
alternative way of summarizing and unifying a connected series of arguments or propositions:
16. On the wall, there are all kinds of posters: posters of movie stars, old newspapers,
racing cars, scenes from movies, etc. Moreover, the lighting is suitably dim and the
music is not too loud to have a conversation. As you take a seat, you'll find a redstripped tablecloth on the table, with the names of the waiter and the cook on one
side. Just as you decide on what to eat, the waiter, in green shirt and black pants, will
be at your side in an instant, kneeling down on the floor as he takes the order. The
food is quite good and the service cannot be better. These are a few reasons why I
prefer Bennigan's over other restaurants
17. The last and most irritating thing for Kim was the comparison with Pak, who had
already won four times in her first year debut in the LPGA tournaments. The mass
media only emphasized their scores totally ignoring their situations. In conclusion,
her mental strength and continuous effort made it possible for her to surmount her
physical weakness, harsh environment and stress from comparison with Pak Seri and
these are the reasons why I admire Kim Mi-Hyun.
Language Learning & Technology
169
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
Genuine examples of the use of synonyms or near-synonyms are quite rare in the corpus. An
example of their use, however, is the following text on the recent government plan to reform
Korean universities:
18. But this plan has problems in four ways, so this should be reconsidered seriously
right now. Firstly, the project is drift in the wrong direction. The scheduled beginning
of the project was postponed the day after it was first announced due to a change of
the education minister. Many revisions have also been done after the original public
announcement, so these are revealing that the project was put together too hastily
("The BK21" The Yonsei Annals). Secondly, this plan meets with most regional
universities. They claim that a disproportionate amount of support would go to the
prestigious universities in the capital area by this project.
The use of retrospective labels as a means for achieving textual cohesion tends to confirm the
working hypothesis that anaphoric reference predominates in interlanguage texts exhibiting basic
emergent texture. In other words, advance labels, in which the label precedes the lexicalization, are
uncommon in the corpus. Moreover, it seems plausible to assume that examples of advance labeling
in interlanguage texts will tend to be resolved within a sentence or two. The one example of
advanced labeling in the corpus, for example, which occurs in a paragraph dealing with the subject of
how to break up with a boyfriend, takes place within the confines of a sentence, across the space of a
full colon:
19. Everything would look perfect when you began to go out with your boy friend. As
time went by, however, you found many problems in the relationship with him.
Finally, you decide to break up with your boy friend. It will be a difficult experience.
However, you can break up with your boy friend if you follow these steps: think about
the reasons to break up, have a break time, get separate and put memorial things
away.
Since it is a working hypothesis that more sophisticated interlanguage texts show a gradual decrease
in the frequency imbalance between anaphoric as opposed to cataphoric reference, the gradual
emergence of advance labels is also a sign of interlanguage development. The ability to pursue lexical
cohesion across larger portions of text is a sign of a progression beyond emergent texture.
THE DEMONSTRATIVES AND CATAPHORIC REFERENCE
Examples of genuine cataphoric reference were rare in the corpus. This is not surprising, given that
the unmarked anaphoric reference is still a source of textual cohesive difficulty at this level of
interlanguage development. Judging by the example of "My Favorite Bar or Restaurant," the plural
near demonstrative those emerges at an apparently very early stage of interlanguage in the formation
of cataphoric non-cohesive reference. Other typical examples of this type of non-cohesive reference
included: those of gothic church; those of Shakespeare's; those who are enrolled in the science high
school; those who had different political orientations; those who have dreams; those who need help;
those within cultural circles; and those who agree with [sic]. The genre of interlanguage writing
containing the most examples of cataphoric reference was that of advice to the reader. One example
of cataphoric reference involving the demonstratives, for example, occurred in a paragraph on the
subject of how to cure a hangover:
20. At first, one preventive measure is this: Never drink enough to get really drunk.
The second example occurred in a paragraph dealing with the recent spate of deadly fires in public
spaces in Korea in which the writer employs the notion of moral hazard to explain the apparent
indifference to safety on the part of many public officials:
Language Learning & Technology
170
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
21. The word moral hazard is defined like this: "Moral hazard arises when individuals,
in possession of private information, take actions which adversely affect the
probability of bad outcomes."
The main point to make about these examples is that they tend to confirm the general idea
developed in the discussion of synonyms and near-synonyms: These interlanguage samples tend to
develop broad rhetorical patterns of textual coherence. It is then within these broadly defined
patterns that finer cohesive relations begin to emerge. In each of these examples, the particular
genre is important. The genre gives to the interlanguage text abstract rhetorical possibilities for
cataphoric reference. Depending on the sophistication of the writer's interlanguage, this abstract
rhetorical possibility may be activated. There are two main points to make about this. The first is
that cataphoric reference in the corpus is nonetheless rare, even in those paragraphs dealing with the
description of process in which it might be expected. The second point is that when it occurs, the
cataphoric reference is resolved quickly, indeed, in each of the three cases cited, intra-sententially.
CONCLUSION
The concept of emergent texture would appear to have a promising future in the ongoing
investigation of interlanguage. In particular, it reveals its usefulness in its relative objectivity as a
means of analyzing lexical relations above that of the individual sentence. This is so long as the
concept of markedness is used in a consistent manner within a project committed openly to the
investigation of actual interlanguage corpora. The marriage of functional grammar and text
linguistics provides a rich store of useful concepts with which to continue this investigation of
interlanguage corpora. One obvious possibility for future work would be the extension of this study of
the nominal demonstratives to related examinations of the emergence and function of the systems of
personal and comparative reference within the single paragraph. Within the framework provided by
Halliday and Hasan, there is also the possibility of future studies extending this initial study to a
corpus of five paragraph essays, taking in the full range of reference including substitution, ellipsis,
and conjunction. The most interesting area for future interlanguage research, however, is undoubtedly
the range of lexical cohesion. This research would involve crucial issues of relevance to many aspects
of second language learning, focusing as it would on the shape and size of interlanguage semantic
fields. Concretely, the analysis would involve in-depth studies of the various kinds of interlanguage
reiteration, including the use of synonyms and near-synonyms, superordinates, and collocates. Much
of this research would have a useful contribution to make to the new field of interlanguage semantics
and second language mental lexicons (cf. Hatch & Brown, 1995). Naturally enough, the issue of
cohesion does not exhaust the complex issues surrounding the evaluation of the relative
sophistication of a given interlanguage text.
This study has argued for the importance of integrating an analysis of the degree of emergent texture
as a means for such evaluation. The analysis has attempted to demonstrate a rough and ready
distinction between low-level interlanguage texts that rarely or never employ the nominal
demonstratives and interlanguage texts of greater sophistication that do. Naturally, an emergent
texture analysis capable of distinguishing among the full range of interlanguage achievement will
require much more detailed corpus research. Moreover, there is an obvious reason why the analysis of
interlanguage cohesion ought to form part of a wider investigation of interlanguage textual
linguistics. Two samples of interlanguage with relatively similar kinds of cohesive relations may differ
widely in terms of the ease, efficiency and appropriateness of the information they covey to the
reader (Beaugrande & Dressler, 1981, p. 34). Happily enough, markedness theory, in the shape of
default settings for the presentation of argument and the establishment of coherence, also has its role
to play at the level of textual coherence (cf. Beaugrande & Dressler, pp. 143-161). In this regard, the
Language Learning & Technology
171
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
role of the corpus and of corpus software will be important as a means for equipping applied linguists
with a more refined set of tools for the analysis of texture and textuality.
NOTE
1. Michael McCarthy, in an otherwise excellent essay, states that Halliday and Hasan do
"nothing to resolve the difference between it on the one hand and this and that on the other"
(1994, p. 267). This is not entirely accurate. Halliday and Hasan resolve this difference
implicitly, stating that "both it and the … can be explained as being the 'neutral' or nonselective type of the nominal demonstratives" (1976, p. 58).
ABOUT THE AUTHOR
Dr. Terry Murphy's interest in second language writing is part of his overall interest in textual
linguistics and narrative discourse within the sociology of culture. This essay is a modified version of
a thesis he submitted to the University of Birmingham in partial fulfillment of the requirements for
an MA in TEFL.
E-mail: tmorpheme@hotmail.com
REFERENCES
Beaugrande, R. de. (1997). New foundations for a science of text and discourse. Norwood, NJ: Ablex
Publishing Company.
Beaugrande, R. de, & Dressler, W. (1981). Introduction to text linguistics. New York: Longman.
Biber, D. (1988). Variation across speech and writing. Cambridge, UK: Cambridge University Press.
Clark, H., & Clark, E. (1978). Universals, relativity and language processing. In J. Greenberg (Ed.),
Method and theory, Universals of Human Language Vol. 1. (pp. 235-277). Stanford, CA: Stanford
University Press.
Connor, U. (1996). Contrastive rhetoric. Cambridge, UK: Cambridge University Press.
Francis, G. (1994). Labelling discourse: An aspect of nominal-group lexical cohesion. In M.
Coulthard (Ed.), Advances in written discourse analysis (pp 83-101). New York: Routledge.
Freedman, A., Pringle, I., & Yalden, J. (Eds.). (1979). Learning to write: first language/second
language. New York: Longman.
Goldmann, L. (1964). The hidden god: A study of tragic vision in the Pensées of Pascal and the
tragedies of Racine (P. Thody, Trans.). London: Routledge.
Greenberg, J. (1966). Language universals. The Hague: Mouton.
Halliday, M. (1992, August 4-8). Language as system and language as instance: The corpus as a
theoretical construct. In Jan Svartnik (Ed.), Directions in corpus linguistics, Proceedings of Nobel
Symposium 82 (pp. 61-77). New York: Mouton De Gruyter.
Halliday, M. (1994). An introduction to functional grammar (2 nd Ed.). London: Edward Arnold.
Halliday, M., & Hasan, R. (1976). Cohesion in English. New York: Longman.
Hatch, E., & Brown, C. (1995). Vocabulary, semantics, and language education. Cambridge, UK:
Cambridge University Press.
Language Learning & Technology
172
Terry Murphy
The Emergence of Texture: An Analysis of the Functions…
Jakobson, R. (1957). Shifters, verbal categories and the Russian verb. Russian Language Project.
Department of Slavic Languages and Literature. Cambridge, MA: Harvard University.
Jakobson, R., & Pomorska, K. (1983). Dialogues. Cambridge, MA: The MIT Press.
Kaplan, R. (1966). Cultural thought patterns in intercultural education. Language Learning, 16, 120.
Kroll, B. (Ed.). (1990). Second language writing: research insights for the classroom. Cambridge,
UK: Cambridge University Press.
Laufer, B., & Nation, P. (1998). Vocabulary size and use: lexical richness in L2 written production.
Applied linguistics, 19(2), 225-254.
McCarthy, M. (1994). It, this and that. In M. Coulthard (Ed.), Advances in written discourse analysis
(pp. 266-275). New York: Routledge.
Rutherford, W. (1982). Markedness in second language acquisition. Language learning, 32(1), 85109.
Rutherford, W. (1987). Second language grammar: Learning and teaching. New York: Longman.
Shaw, P., & Liu, E. (1998). What develops in the development of second-language writing? Applied
linguistics, 19(2), 225-254.
Waugh, L. R. (1976). Roman Jakobson's science of language. Lisse, The Netherlands: The Peter De
Ridder Press.
Language Learning & Technology
173
Language Learning & Technology
http://llt.msu.edu/vol5num3/wang/
September 2001, Vol. 5, Num. 3
pp. 174-184
EXPLORING PARALLEL CONCORDANCING
IN ENGLISH AND CHINESE
Wang Lixun
The Open University of Hong Kong
ABSTRACT
This paper investigates the value of computer technology as a medium for the delivery of
parallel texts in English and Chinese for language learning. An English-Chinese parallel
corpus was created for use in parallel concordancing -- a technique which has been
developed to respond to the desire to study language in its natural contexts of use.
Specific problems of dealing with Chinese characters in concordancing are discussed. A
computer program called English-Chinese Parallel Concordancer was developed for this
research. The operation of the program is demonstrated through screen shots. The
pedagogical application of parallel concordancing in English and Chinese is illustrated
through examples from some teaching and learning experiments, and the Data-Driven
Learning approach is applied and explored. It is hoped that parallel concordancing in
English and Chinese will become a useful and popular tool for both English and Chinese
learners in their second language learning.
INTRODUCTION
Parallel concordancing is a tool which has been developed to respond to the desire (fuelled by linguists
such as Sinclair) to study language in its natural contexts of use. It allows us to place side by side for
comparison two contexts produced for a given item -- phrase, word, or morpheme -- one being a
translation of the other. It has many uses in translation studies and in translation pedagogy, such as in the
compilation of bilingual dictionaries. However, in the present paper it is the pedagogical value of parallel
concordancing which will receive attention.
The main research interest in this paper is in the use of parallel concordancing in the teaching of
languages, specifically in its use as a form of consciousness-raising, of making learners aware of the
differences between the target language and their own language (Rutherford, 1987). By comparing the
contexts obtained for an item in one language, with the translations of the contexts in the other language,
learners can see how the item is rendered according to varying contextual elements (Roussel, 1991). This
can be useful pedagogically as, for example, it can help to prevent the L2 of more advanced learners from
becoming fossilised and settling into the use of cognate but contextually inappropriate structures in the
target language. It can help one to look at the way a given structure is used in different styles or registers,
or by different age groups, or by native and foreign speakers (King, 1989).
Barlow, who developed the ParaConc (Barlow, 2001) program for parallel concordancing, claims that
parallel texts (texts that are translations of each other) are a promising resource for a range of research
projects related to language learning. Using parallel texts, as he puts it, "allows language learners to
directly investigate (perhaps in response to queries posed by the teacher) the main correspondences
between particular words and structures in two languages" (Barlow, 1996a). It helps beginning learners to
create an awareness for the feel of a second language and also to obtain some concrete knowledge of
correspondences. It also helps advanced learners to deepen their knowledge of words and phrases: to
understand not just the main meaning or most common meanings of a word, but to understand a range of
meanings and to perceive how context in terms of discourse and genre provides clues to the appropriate
Copyright 2001, ISSN 1094-3501
174
Wang Lixun
Parallel Concordancing in English and Chinese…
meaning (Barlow, 1996a, 1996b). In this paper, some pedagogic applications of parallel concordancing
are explored, making use of Barlow's insights and also the Data-driven Learning (DDL) approach (Johns,
1991, 1993, 1994), which will be discussed in the section "Parallel Concordancing for Lexical Learning."
To carry out parallel concordancing in English and Chinese, I constructed an English-Chinese parallel
corpus and developed a software package, English-Chinese Parallel Concordancer (Wang, 2000). A
concordance example of the word xian4zai4 (now) is discussed in the paper, revealing an insight into
different uses of the word, and how the findings can be applied in language learning. (Xian4zai4 is
Chinese Pinyin, the Roman transliteration of Chinese characters, which is used throughout this paper for
the convenience of English readers. The numbers are tone markers.)
PROBLEMS OF DEALING WITH CHINESE CHARACTERS IN CONCORDANCING
Although parallel concordancing has been carried out between several European languages, it seems not
to have been previously extended to non-alphabetic languages such as Chinese. This is due to
fundamental differences in the language systems which create complex conceptual and computational
problems of alignment. The most immediate differences between Chinese and the European languages are
that Chinese is written in ideograms rather than alphabetic characters, and that it lacks the properties of
most European grammatical systems. For example, it has no articles, no tenses, no participles or gerunds,
no moods, and virtually no inflections. It even had no punctuation, until it was introduced from the West
at the end of the 19th century.
Even in a language as English, the definition of "word" can be problematic. For example, is "crabmeat"
one word, or two? However, it is even more problematic for a language as Chinese to define word.
Written Chinese gives no indication of which characters are to be considered as words and which
combine with others to form compound words. For example, according to standard Chinese grammar
rules, ban4 (half), tu2 (way), er2 (but), and fei4 (give up) are four words, which should be separated by
spaces. But most Chinese people consider this four-character combination a single word (give-up-halfway). This type of combination is very common in Chinese, having a similar function to that of an idiom
in English, although the characters in it normally keep their original meanings rather than combine with
others to form compound words. Also, unlike English idioms, the compound can function as an adjective,
adverb, or verb, which might explain why people usually regard it as a single word.
If we want to take account of the non-correspondence between character and word, we must first develop
some way of establishing when a string of characters can be considered a word. Then, in entering the
Chinese text on computer, spaces can be inserted between these conceptual words to correspond to the
standard graphical indication of a word in English. Thus wo3 (I), qi2 (ride) zi4 xing2 che1 (bicycle) would
be entered as wo3 qi2 zi4xing2che1.
However, there are a number of technical problems associated with this form of alignment. It seems
impractical to design a computer program to insert spaces automatically, since two successive characters
may be either one or two words according to the context. This means that the spaces have to be added
manually, which is costly in terms of time and money. Furthermore, the end-user searching for a word
with the retrieval software may conceive of words differently from the original corpus compiler and may
have to make several attempts to match the compiler's input.
Given the technical and conceptual problems associated with non-correspondence alignment, it appeared
that the only practical solution was to make an assumption of character-word correspondence and thus
treat each Chinese character as a word. Having made this assumption, the inputting task was made easier
by the Chinese word processor NJStar, which not only inserts spaces between Chinese characters
automatically, but can also convert Chinese characters into Pinyin, which is very important for Englishspeaking people wanting to learn or pronounce Chinese.
Language Learning & Technology
175
Wang Lixun
Parallel Concordancing in English and Chinese…
CREATING AN ENGLISH-CHINESE PARALLEL CORPUS
Unlike other concordancing programs such as Microconcord (Johns, 1986) or Wordsmith (Scott, 2000),
which can be used on any collection of texts, a parallel concordancer must be used on a corpus consisting
of parallel texts in two or more languages. Before developing the concordancing program, then, it was
necessary to select texts in order to set up an English-Chinese parallel corpus.
The corpus aims at helping intermediate English or Chinese language learners, such as university
students, further improve their second language. Thus, the texts chosen were English or Chinese texts
which are fairly easy to understand from the point of view of vocabulary, syntax, and discourse.
University students are usually interested in genres such as novels, fables, essays, autobiographies,
magazines, and general scientific articles, so these genres were taken into first consideration. To keep a
balance, about half the source texts were in English and half in Chinese. Only written materials were
collected, as it was too difficult for the present research to cover transcribed spoken materials. To ensure
that the quality of translation was good, only published translations were selected. The corpus now
contains about 1 million words in English and 2 million characters in Chinese. Table 1 shows the
percentage of genres distribution in the corpus.
Table 1. Percentage of Genres Distribution in the Corpus
Genre
%
novel
50
essay
15
fable
10
autobiography
5
scientific article
5
political address
5
magazine
5
other
5
Initially, the method of inputting the texts was to scan in English texts and type in Chinese texts.
Subsequently, Chinese texts were scanned with SunmiPage ScanInsert OCR software (Liang, 1997) and
then edited. The texts used are either copyright-free or permission has been obtained from the authors.
After editing, the texts needed to be marked up. The purpose of marking up texts is to define sentence and
paragraph boundaries so that a sentence in one text can be matched with its translation in the other by the
parallel concordancing program. In order to keep the size of the text files as small as possible, minimal
marking up was used: The only necessary element is <S> to identify sentence boundaries, as the program
was developed in such a way as to recognise paragraph boundaries without special markers.
Electronically, each Chinese punctuation mark occupies two bytes, while each English mark occupies
only one byte. A program was developed to automatically mark up Chinese text according to Chinese
punctuation and English text according to English punctuation.
THE DEVELOPMENT OF THE ENGLISH-CHINESE PARLLEL CONCORDANCER
Since 1997, I have been developing the English-Chinese Parallel Concordancer (E-C Concord), and the
first version was successfully completed in 2000. It works in a Windows95/98 environment, and can carry
out sentence-by-sentence parallel concordancing in English, Chinese, and Pinyin. The main technical
problem in developing a program for parallel concordancing related to the alignment method used for
identifying equivalent sentences between texts. A major problem in aligning texts arises when the number
of sentences in the source language differs from that in the target language. The situation could also arise
where the number of sentences in a paragraph is the same, but the divisions between them do not
coincide. A program called Multiconcord (Woolls, 1997) had previously been developed at the University
of Birmingham, using an algorithm which automatically looks for disturbance between the two texts and
re-establishes the matches by joining several short sentences together in one language to match a long one
in the other. The algorithm gives satisfactory accuracy in aligning parallel texts in European languages
(Woolls, 1998). However, an adaptation of this program to align texts in English and Chinese only
achieved an accuracy of about 60%, based on an accuracy test carried out by Woolls and the author. The
decision was then taken that for the present research the texts would be pre-aligned -- which of course
Language Learning & Technology
176
Wang Lixun
Parallel Concordancing in English and Chinese…
gives an accuracy of 100%. That accuracy is achieved at the cost of time-consuming manual pre-editing
of the texts.
Figure 1. Screen shot of the search window of E-C Concord
The program allows the user to type in a search item in the "search box," and choose a Search Language
and a Target Language. When entering an English or Pinyin search item, wild cards (*) are acceptable, so
that "book*" can be "book," "books," "booking," "booked," and so forth, and "wang*" can be "wang1,"
"wang2," "wang3," or "wang4." Wild cards cannot be used with Chinese characters. The user needs to
select one or more text files from the file list: These files contain the corpus data. The program provides
three ways of concordancing: (a) Monolingual Concordance, Key-Word-In-Context; (b) Monolingual
Concordance, Sentence-by-Sentence; and (c) Parallel Concordance, Sentence-by-Sentence. The user can
also control the maximum search hits. After making all the necessary choices and pressing the "Search"
button, the user will get a result such as shown in Figures 2 and 3.
Language Learning & Technology
177
Wang Lixun
Parallel Concordancing in English and Chinese…
Figure 2. Parallel concordance of "now": Chinese character output
Figure 3. Parallel concordance of "now": Chinese Pinyin output
Language Learning & Technology
178
Wang Lixun
Parallel Concordancing in English and Chinese…
The concordance output is in sentence-by-sentence format, which consists of pairs of English and Chinese
sentences, one been the translation of the other in the pair. The text can be edited on screen and saved as
text files for further studies.
PARALLEL CONCORDANCING FOR LEXICAL LEARNING
More than one and a half centuries ago, von Humboldt (1836/1988) pointed out that "we cannot, properly
speaking, teach a foreign language: all we can do is create the conditions under which it can be awakened
in the soul" (p. 236). Using Humboldt's insights, and based on the data generated by the concordancer,
Johns (1991) proposed a new language-learning approach, which he called Data-Driven Learning (DDL).
The DDL approach puts emphasis on the inductive acquisition on the part of students of grammatical
rules or regularities through the process of analysing the patterns of language use of specially selected
items as revealed through corpora (Johns, 1991; Tribble & Jones, 1990). Johns's remark "Every student a
Sherlock Holmes" implies that the role of the learner has changed in DDL: A learner is a researcher,
testing hypotheses and revising them in the light of data; a learner is a detective, finding and interpreting
linguistic clues. DDL can focus on different aspects of language. This paper focuses on lexical learning
using DDL. The following is an example of what a learner can detect by analysing parallel concordance
data.
The lexical item studied here is the adverb xian4zai4 (now), as it is a very common and important word,
but one not satisfactorily covered by bilingual dictionaries. Some differences in the use of xian4zai4 and
"now" in the two languages are discussed below.
One hundred and twenty-eight examples were found in four different texts (novels). Forty examples were
randomly selected from them, and were classified into several groups. The idea was to ask Chinese
students at an intermediate English level to identify the linguistic bases of the grouping. In order to
compare Chinese characters with English words more clearly, the Pinyin transcription identifies its
separate "words."
The following abbreviations, as used by Li & Thompson (1981), were used in the examples:
Abbreviation
Term
T
translation
O
original
CRS
currently
relevant
state (le)
PFV
perfective
aspect (le)
ASSOC
associative
(de)
GEN
genitive
(de)
CL
classifier
3sg
third
person
singular
Some of the above abbreviations were used because certain Chinese characters, such as those for de and
le, cannot be translated directly into English words. Furthermore, each of these two has two distinct
meanings which depend on the context. Many Chinese classifiers cannot be translated into English, as
they simply do not exist in English, where, for example, one speaks of "a herd of cows," but there is no
classifier for a single cow. The third person singular pronoun ta1 in Pinyin does not show the gender, so it
cannot be automatically translated into "he" or "she."
Eight Chinese students in the University of Birmingham were asked to accomplish the following tasks
concerning the adverb xian4zai4 (now).
Language Learning & Technology
179
Wang Lixun
Parallel Concordancing in English and Chinese…
Task 1
Look at the following data:
1. T: di2que4 shi4 zhe4 yang4: ta1 xian4zai4 zhi1 you3 shi2 ying1cun4 gao1 le5, ......
truly
be like this
3sg now only have ten inch
high CRS
O: And so it was indeed: she was now only ten inches high, …
2. T: shi4shi2shang4, ta1 xian4zai4 yi3
yuan3 bu4zhi3
jiu3 ying1chi3 gao1, ...
in fact
3sg now
already much not less than nine feet
high
O: in fact she was now rather more than nine feet high, …
3. T: ta1 wan2quan2 wang4ji4 le5 ta1 xian4zai4 bi3 tu4zi3 da4 shang4 yi1qian1 bei4,
3sg completely forget PFV 3sg now
compare rabbit big up a thousand times
O: …quite forgetting that she was now about a thousand times as large as the Rabbit, …
4. O: wo3 xian4zai4 yi3jing1 cheng2 le5 ming2 fu4 qi2 shi2 de5 gong1ren2 ...
I
now
already become PFV name agree that fact GEN worker
T: I was now a bona fide worker …
Question: What underlying pattern can be detected in the above parallel texts?
What the students found was that, in the Chinese examples, xian4zai4 immediately follows the subject,
while in the English ones, now follows "subject + be." They were then asked whether this was always the
case. They carried out more concordancing and found that there was no such structure as "subject + verb
(be) + xian4zai4" in Chinese in the corpus. The conclusion they drew was that Chinese speakers should
pay special attention to the structure "subject + verb (be) + now" in English, as this structure does not
exist in Chinese. They also suggested that English speakers learning Chinese should avoid adding an
unwanted verb (be) to a Chinese sentence.
Task 2
5. T: "xian4zai4 gai1
dao4 hua1yuan2 li3qu4 la5!"
now
should go to
garden into !
O: "And now for the garden!"
6. T: "kuai4dian3, xian4zai4 jiu4
qu4!"
quick
now
immediately go
O: "Quick, now!"
7. T: ba3 ta1de5 tou2 tai2 gao1 -- xian4zai4 na2 bai2lan2di4 lai2 -make his
head raise high
now
O: Hold up his head -- Brandy now --
fetch
brandy
come
Question: Why are the English versions of the above sentences so much shorter than the Chinese ones?
The students found that in the English sentences various subjects and verbs around now were not present.
For example, "And now (I should head) for the garden," "Quick, now (you go there immediately)," and
"(You go and fetch some) Brandy now." In the Chinese translation, however, the words struck through
were presented, such as "should go to ... into" in Example 5, "immediately go" in Example 6 and "fetch ...
come" in Example 7. The students concluded that in Chinese the adverb xian4zai4 could not be used
independently, and some words not present in the English sentences were required in the Chinese
translation. They realised that certain structures which are acceptable in English are not acceptable in
Chinese, and vice versa. It seems that in the above Chinese sentences, 'the law of least effort' was not
followed.
Language Learning & Technology
180
Wang Lixun
Parallel Concordancing in English and Chinese…
Task 3
8. O: wo3 xian4zai4 bu4 chi1 zhi1shi4 wo3 bu4 xiang3 chi1 ta1 ba4 le5.
I now
not eat only
I
not want eat 3sg
CRS
T: But I didn't choose to just yet.
9. O: wo3 xian4 zai4 shi4 "zu3 zhang3" le5, geng4 zhu3yao4 de
I
now
be group leader PFV even mainly
T: Because I was "group leader" and, even more, …
shi4 ......
GEN be
10. O: wo3 xiang3 ta1 bu4 shi4 sui2 kou3 zhe4yang4 shuo1 de, ke3neng2 shi4 you3yi4shi4di4
I
think 3sg not be casually in this way speak ASSOC may
be
intentionally
yao4 rang4 wo3 zhi1dao4 wo3 xian4zai4 bu4 tong2 yu2 guo4qu4 de
shen1fen1.
want let
I
know
I now
not same
past ASSOC status
T: I suspected that he said this to let me know my changed status.
11. O: na3me5, wo3 xian4 zai4 sheng1huo2 yu2 qi2jian1 de
zhe4ge4 xin1 de
sheng1cun2
then
I
now
live
in between ASSOC this
new ASSOC living
huan2jing4 shi4 zen3yang4 de5 ne5?
surroundings be what
GEN?
T: So what about my life in these new surroundings?
Question: What is missing from the English versions of the above sentences? Why?
The students easily found that xian4zai4 occurs in the Chinese text but now did not appear in the English
translation.
The students observed that the English translation in Example 8 simplified the original Chinese sentence.
There were two sentence structures parallel to each other in the Chinese sentence, the first stating the fact
that "I now (do) not eat," the second telling the reason "I (do) not want (to) eat." Having further studied
the extended context of the sentence in the original text, the students realised that the narrator of the
sentence was in a state of starvation most of the time, so to be able to choose whether to eat or not was
very satisfying, and the feeling was expressed through the parallel sentence structure. The English
translation used prospective contrast, and it simplified the sentence. The students felt that it was not as
expressive as the original Chinese sentence.
In Example 9, the students argued that now was not used in the English translation because the past tense
"was" was clear enough and now was not necessary. In the Chinese version, the combination "now ... le
(PFV)" served the same purpose as "was."
In Example 10, the students found that the Chinese version used contrastive structures twice: "casually in
this way speak" versus "intentionally want let I know" and "now" versus "past," but neither appeared in
the English translation. They argued that contrastive structures were frequently used in Chinese to make
the meaning of sentences absolutely clear, but in English quite often such structures were not used so as
to make sentences simpler.
In Example 11, the students found it logically reasonable that the word now did not appear in the English
translation: One could not live in the past in "new surroundings." Although it sounded redundant, the
word xian4zai4 should not be omitted from the Chinese sentence.
Having studied examples where xian4zai4 occurred in the Chinese original but now did not appear in the
English translation, the students carried out more parallel concordancing looking for examples where now
occurred in an English original but xian4zai4 did not appear in the Chinese translation. The following are
some examples they found:
Language Learning & Technology
181
Wang Lixun
Parallel Concordancing in English and Chinese…
12. O: "Now, Dinah, tell me the truth: did you ever eat bat?"
T: "wei4,
dai4na4, gen1 wo3 shuo1 shi2 hua4, ni3 chi1 guo4 bian1fu2 mei2you3?"
wei (draw attention) Dinah to
me say real words you eat PFV bat
not
13. O: ...her face brightened up to think that she was now the right size for going through the little door
into that lovely garden.
T: xiang3 dao4 ta1 mu4qian2
de shen1cai2 zheng4hao3 neng2 tong1 guo4 na3 shan4 xiao3
think 3sg
in front of eyes ASSOC size
right
can go through that CL little
men2, ke3yi3 jin4ru4 na3 ke3ai4 de hua1yuan2, ta1 xi3 xing2
yu2
se4.
door can
enter that love -ly
garden
3sg joy reflect through (face) colour
14. O: She found that she was now about two feet high, ...
T: ta1 fa1xian4 ci3ke4
zi4ji3 shen1 gao1 da4yue1 liang3 ying1chi3...
3sg find
this moment self body height about two
feet
15. O: "Now tell me, Pat, what's that in the window?"
T: "hao3le, gao4 su4 wo3, pa4 te4 , chuang1zi3 li3 na3 dong1xi1 shi4 shen2me?"
all right tell
me Pat
window
in that thing
be what
Having studied the examples, the students realised that xian4zai4 is not the only translation of now, it can
be translated as mu4qian2 ("in front of eyes"), ci3ke4 ("this moment"), and possibly other words, and
sometimes now is used as a word for drawing attention rather than for referring to time: wei4 ("well" or
"listen") and hao3le ("all right"). Discoveries like this certainly help learners to be more aware of
different uses of words in different contexts. Their L2 is less likely to become fossilised, and they will be
able to see more of the subtle differences between meanings, and will try to avoid using cognate but
contextually inappropriate structures in the target language.
The above discussion shows the possibility of using parallel concordance data as teaching materials for
Data-driven Learning purposes. The teacher can either put data into groups for students to study, or ask
them to carry out concordancing on a particular lexical item, analyse the data, and ask them to submit
what they have found through the analysis.
CONCLUSION
Technically, parallel concordancing between English and Chinese has been established successfully, and
further tasks can be developed and experimented with students at different level to increase their, and
their teachers', familiarity with the methodology. It is highly possible that the English-Chinese
Concordancer (Wang, 2000) can be extended to Japanese and Korean, as like Chinese, they use
ideograms rather than alphabetic letters. Experience suggests that the parallel concordancer is one of the
most powerful tools that computer science can offer to language researchers. The distinctive feature of the
Data-driven Learning approach to inductive language teaching is that the language data are primary, and
the teacher does not know in advance exactly what rules or patterns the learner will discover. DDL with
the support of parallel concordancing will help the learner to develop in-depth knowledge of lexical
meaning and use based on evidence from authentic language.
Language Learning & Technology
182
Wang Lixun
Parallel Concordancing in English and Chinese…
ABOUT THE AUTHOR
Wang Lixun was born in China. He was awarded a PhD in Computational Linguistics at the University of
Birmingham, UK, in 2000. His research interests include computer-assisted language learning; corpus
linguistics; Web-based language learning. He has developed the software English-Chinese Parallel
Concordancer, Bilingual Sentence Shuffler, and MatchUp. He has also developed his homepage and the
ECLEPT Web site. He currently works in the School of Arts and Social Sciences at The Open University
of Hong Kong.
E-mail: lxwang@ouhk.edu.hk
REFERENCES
Barlow, M. (1996a). Parallel texts in language teaching. In S. Botley, J. Glass, A. M. McEnery, & A.
Wilson (Eds.), Proceedings of teaching and language corpora 1996 (UCREL Technical Papers Volume
9; pp. 45-56). Lancaster, UK: University Centre for Computer Corpus Research on Language.
Barlow, M. (1996b). Corpora for theory and practice. International Journal of Corpus Linguistics, 1(1),
1-37.
Barlow, M. (2001). ParaConc [Computer software]. Houston, TX: Athelstan.
Humboldt, W. von. (1836/1988). On language: The diversity of human language-structure and its
influence on the mental development of mankind (P. Heath, Trans.). Originally published as the
introduction to Uber die Kavi-Sprache auf der Insel Java (1836-1840). Cambridge, UK: Cambridge
University Press.
Johns, T. F. (1986). Microconcord: A language-learner's research tool. System, 14(2), 151-162.
Johns, T. F. (1991). Should you be persuaded -- two samples of data-driven learning materials. In T. F.
Johns & P. King (Eds.), Classroom concordancing (English Language Research Journal 4; pp. 1-13).
Birmingham, UK: Birmingham University.
Johns, T. F. (1993) Data-driven learning: An update. TELL & CALL, 1993(2), 4-10.
Johns, T. F. (1994) From printout to handout: Grammar and vocabulary teaching in the context of datadriven learning. In T. Odlin (Ed.), Approaches to pedagogic grammar (pp. 293-313).Cambridge, UK:
Cambridge University Press.
King, P. (1989) The uncommon core: some discourse features of student writing. System, 17(1), 13-20.
Li, C., & Thompson, S. (1981). Mandarin Chinese. Berkeley, CA: University of California Press.
Liang, X. M. (1997). SunmiPage ScanInsert OCR [Computer software]. Singapore: Computek
Enterprises Pte Ltd.
Roussel, F. (1991). Parallel concordances and tonic auxiliaries. In T.F. Johns & P. King (Eds.),
Classroom concordancing (English Language Research Journal 4; pp. 71-103). Birmingham, UK:
Birmingham University.
Rutherford, W. E. (1987). Second language grammar: Learning and teaching. London: Longman.
Scott, M. (2000). WordSmith Tools Version 3.0 [Computer software]. Oxford, UK: Oxford University
Press.
Tribble, C., & Jones, G. (1990). Concordances in the classroom: A resource book for teachers. London:
Longman.
Language Learning & Technology
183
Wang Lixun
Parallel Concordancing in English and Chinese…
Wang, L. X. (2000). English-Chinese Parallel Concordancer [Computer software]. Birmingham, UK:
University of Birmingham.
Woolls, D. (1998, July 24-27). Multilingual Parallel Concordancing for Pedagogical Use. Teaching and
Language Corpora 98 (pp 222-227). Oxford, UK: Keble College.
Woolls, D. (1997). Multiconcord [Computer software]. Birmingham, UK: CFL Software Development.
Language Learning & Technology
184
Language Learning & Technology
http://llt.msu.edu/vol5num3/stjohn/
September 2001, Vol. 5, Num. 3
pp. 185-203
A CASE FOR USING A PARALLEL CORPUS AND CONCORDANCER
FOR BEGINNERS OF A FOREIGN LANGUAGE
Elke St.John
University of Sheffield, UK
ABSTRACT
This pilot study set out to determine whether a parallel corpus and a concordancer would
be appropriate tools to supplement a teaching programme of German at the beginners'
level in an unsupervised environment. In this instance, a beginner student of German was
asked to find satisfactory answers to unknown vocabulary and formulate appropriate
grammar rules for himself using the parallel corpus and concordancer as the only tools. It
is shown that these tools can be of great benefit for beginners.
AIMS AND OBJECTIVES
I describe a pilot study involving a beginner student of German who undertook a supplementary
unsupervised programme of learning German using a concordancer and a parallel corpus. I investigate
how a beginner student of German fares using a concordancer, Multiconcord (see King & Wools, 1996;
St.John & Chattle, 1998), and a parallel German/English corpus, INTERSECT (Salkie, 1995) consisting
of the original German source texts and their English translations. The aim of this study was to determine
how this student copes using the parallel corpus and what conclusions he comes to when comparing the
two languages, and in particular, when investigating lexical items. As students at the beginner and
intermediate levels are still very dependent on a dictionary, their lack of vocabulary in the new language
can often cause problems for them in class. As a consequence, most of the questions set were related to
investigating the meaning of words (see Student Tasks).
Additionally, using corpora and a concordancer can be motivating and rewarding not only for the learner
but also for the teacher. For the teacher, these tools can provide contextualised examples to confounding
lexical questions. Moreover, the learner can develop an ability to "learn how to learn" (Johns, 1991a, p. 1)
by being allowed to assume the role of an explorer. This study supports Barlow's (1995a, 1996a, p. 2)
claim that one of the roles the language learner plays when using corpora is that of a language researcher
and explains why "a suitable research environment" must be provided (Barlow, 1996b, p. 45; see also
Johns, 1986, p. 151, 1991a, p. 2). This therefore assists the student in exploring the language in great
detail and thereby gaining further insights into its grammar and vocabulary.
The use of concordancing in language teaching is not new. However, this pilot study demonstrates for the
first time the potential of concordancing in learning German at the beginner's level.
CONCORDANCER AND CORPORA IN LANGUAGE ENVIRONMENTS
Concordancing is a tool that has been used extensively by linguistic and literary researchers. A
concordance is a list of the occurrences of either a particular word, or a part of a word or a combination of
words in context and it is drawn from a text corpus, which is presented in context. A corpus is a large
body of text often in electronic format. (see Baker 1995, p. 226; Francis, 1993, p. 138; Johansson, 1995,
p. 19; Leech, 1991, p. 8 for more detailed definitions)
Linguistic and applied linguistic researchers are not the only group who can benefit from the use of
concordancing as a tool for language learning (i.e., as a means of exploring the meanings and uses of
Copyright © 2001, ISSN 1094-3501
185
Elke St.John
A Case for Using a Parallel Corpus…
words in their authentic contexts; see Aston, 1997a; Tribble, 1997). A concordance program enables
research into the lexical, syntactic, semantic, and stylistic patterns of a language.
Concordancer and monolingual text corpora (comprising only one language) have already been employed
by both the language teacher and learner in classroom exercises. Typical exercises using a monolingual
English corpus have included vocabulary building and the exploration of the grammatical and discourse
features of texts. For specific descriptions of classroom activities (mainly for EFL teaching, however)
using a monolingual English corpus, see, for example, Aston (1997a, p. 51-64), Mindt (1997, p. 40-50),
Minugh (1997, p. 67-82), Murphy (1996), Flowerdew (1993, 1996), Stevens (1991a, 1991b), Tribble
(1990), and Johns (1986, 1991a, 1991b). In a well-known quote, Johns advocates the DDL (Data Driven
Language) approach. The advantage of this approach is that, in a classroom situation, it enables the
teacher to play a less active role whilst at the same time exposes the student to authentic texts like those
found in a monolingual corpus:
What distinguishes the DDL approach is the attempt to cut out the middleman as much as
possible and give direct access to the data so that the learner can take part in building his
or her own profiles of meanings and uses. The assumption that underlies this approach is
that effective language learning is itself a form of linguistic research, and that the
concordance printout offers a unique resource for the stimulation of inductive learning
strategies -- in particular, the strategies of perceiving similarities and differences and of
hypothesis formation and testing. (Johns, 1991b, p. 30)
Experiments in data driven learning and corpus-based methods (e.g., Baker, Francis, & Tognini-Bonelli,
1993; Barlow, 1995b, 1996a; Dickens & Salkie, 1996; Lewandowska-Tomaszcyk & Melia, 1997; Salkie,
1995, 1996; Tognini-Bonelli, 1996; Wichmann, Fligelstone, McEnery, & Knowles, 1997) are beginning
to bear fruit in a wide range of language environments although there is as yet only a limited amount of
experience on which to draw regarding learning German using a parallel corpus.
With regard to monolingual corpora, they have already been used to teach German.
Dodd (1997) exploits a corpus of written German for advanced language learning. After browsing
through a raw corpus, his students compare corpus evidence with reference works. Dodd concludes that a
computer-supported investigation of language corpora provides a powerful and simple tool for language
learning. Fernández-Villanueva (1996) used a German monolingual corpus of oral language to research
the function of German particles. She describes it as a very positive experience because it allows students
to investigate the function of the particles, which do not have a direct equivalent in their mothertongue.
Wichmann (1995) used a monolingual English corpus for teaching German and sorting out problems of
lexical choice. She proposes the use of both corpora and concordancer because dictionaries do not provide
enough information of meaning in context (see Barlow, 1996b, p. 54). However, Wichmann's study does
not explain what kind of exercises she set her students.
Parallel corpora (sometimes also called translation corpora) have already been successfully used by
linguistic researchers for their research into the nature of translation. Zanettin (1994) focuses on the use of
concordancing software on bilingual English/Italian parallel subcorpora to design language activities
aimed at developing translation skills. Like this pilot study, he emphasises that concordancing programs
"can be run by students at any time in a self-access environment, provided that instructional sheets
explaining the background for the activity are supplied" (p. 108). Salkie (1996) also employs a parallel
corpus to investigate grammar problems but concentrates on epistemic modality in English and French.
Dickens and Salkie (1996) compare French/English bilingual dictionaries with a parallel corpus and show
in analogy to this study how many equivalents one single word can actually have. Barlow (1996a)
discusses research based on the analysis of parallel texts (English/Spanish) with particular regard to the
translation of reflexive pronouns. He also advocates some uses for parallel texts in the language
Language Learning & Technology
186
Elke St.John
A Case for Using a Parallel Corpus…
classroom as it is carried out in this study. The unifying theme in his article is the notion that the use of
corpora and a concordancer allows everyone, from the theoretical linguist to the student learning a second
language, to become a researcher (p. 2). This notion is actually combined in the present study because the
student observed is both a linguist (his major) and a language learner. In the analysis in Meaning of
Particles (tasks 2-6), the student discovers that there are many English equivalents for a certain German
particle. This reflects Barlow's (1996b, p. 53) observation that a basic search for concordances can make
students aware that the French translation of head is not always tête. Barlow (1996b, p. 54) concludes that
a parallel text provides an online contextualised dictionary, which language learners can exploit in a
similar way to that demonstrated in the student's tasks 2-6 under Meaning of Particles.
Danielsson and Ridings (1996) report on their tool for work in parallel corpora (Skandinavian
languages/English) and their efforts to integrate it into an academic programme for training translators.
However, parallel corpora have not only been used for research into translation and translator training
(see Baker, 1993, 1995; Buyse, 1997; Piotrowska, 1997; Schmied, 1994; Ulrych, 1997), they can also
prove very useful to non-advanced language learners, as this pilot study will endeavour to demonstrate.
Finally, McEnery, Wilson, and Baker examine how corpora can meet the needs of grammar teaching at
the pre-tertiary level in the UK. In general, they come to the conclusion that a corpus should be at least
integrated into teaching. They further conclude that "corpus data present a means by which grammar
teaching may be more effective -- and more importantly may be rated more positively by learners" (1997,
p. 15).
It can be seen from the literature that parallel corpora have already been successfully employed in a
number of studies. However German/English parallel corpora have not yet formed part of a study. In this
present study, the student had to research a set of questions on his own and what is novel in this study
about classroom concordancing is that the student is at beginner's level working on his own and that a
German/English parallel corpus as opposed to a monolingual corpus was used. A parallel corpus was
used, not only for investigating patterns in the language he was learning, but also to compare it with his
mother tongue and to draw conclusions from it.
BACKGROUND
Corpus Used in This Study
The German-English INTERSECT corpus (Salkie, 1995) which was used for this study has about
800,000 words and comprises the following files:
Table 1. Composition of the INTERSECT Corpus (parts not used in the present study in italics)
file name
Dbank
newsapr
newsjan
Euro
UN
hertzgog
Basiclaw
content
Annual bank reports
news reports
news reports
EU texts
United Nations documents
Transcripts of speeches by the
President of Germany
Constitution Texts
comment
Hoechst, BASF, Siemens
From the "German News" Web site
From the "German News" Web site
Spoken (President Herzog)
Germany, Switzerland, & Austria
The student worked with six files only. The constitution texts which are also part of the INTERSECT
Corpus were not used because of the complexity of German legal language structures.
Language Learning & Technology
187
Elke St.John
A Case for Using a Parallel Corpus…
The corpus includes a variety of text types including spoken language, and it is thus both appropriate and
sufficient for this pilot study because tendencies rather than rules are discovered. Corpus size is obviously
a matter of considerable discussion and is not the point of this particular paper but the subject of further
research. However, the problem with large corpora for language learners, especially beginners and
intermediate students, is that concordances of frequent words can easily become too long and
meaningless. This can be very demotivating for the beginner student. Aston comments in this respect that
"work with small specialised corpora can not only be a valuable activity in its own right, as a means of
discovering the characteristics of a particular area of language use, but also an instrument to help and train
learners to use larger ones appropriately" (1997b, p. 61). The use of a small corpus has both advantages
and disadvantages: Since the amount of data searched is relatively small, any observations on frequency
of occurrence may be ungeneralisable, while on the other hand it avoids a proliferation of examples,
particularly of common words which would prove too daunting to learners. When using a small corpus,
the obvious strategy to employ is to focus on common words. In comparing the corpus with dictionaries,
this is a logical approach in any case: if the corpus gives some clues about which words occur fairly often,
this is in itself useful information as will be shown in the analysis.
Student
As already mentioned, I decided to only use one student for this particular study for several reasons. The
literature review already shows that beginner language students had not previously been involved in
corpus-based studies. In my view, it would present too great a risk if several students were included in the
very first experiment of this kind. As with other new technologies before it, such as the language
laboratory, a step-by-step introduction is probably most effective. As Flowerdew puts it:
There is a danger of the enthusiasm for concordancing being inflated to such an extent
that concordancing is seen as a sort of language teaching panacea. (1996, p. 112)
Therefore, carefully conducted evaluative studies will ensure that such an inflated view will not prevail. A
study carried out on a small scale such as this, will be able to offer proper guidance to large-scale studies
using concordance tools.
Furthermore, in a beginners' class, where the students are generally less confident than in an immediate or
advanced class, it is usually more difficult to encourage and motivate them to take part in a project. A
project involving new technologies would present in my judgement an even heavier threat to the students.
Just as Stevens (1995, p. 2) divides language teachers into three groups, namely those who have never
heard of concordances, those who have not yet taken them seriously, and those who actively use them,
students could be divided into the same groups with beginner students most likely falling into the first
group. Therefore, caution needs to be exercised when starting a project involving relatively new
technology. I therefore decided to introduce only one new variable at a time, starting in this study with a
beginner with a background in linguistics. I then propose to introduce a second variable (a beginner with
no linguistic background, e.g., a student majoring in science) in a future experiment.
Out of all non-specialist language learners I teach, I considered a student with his main subject in
linguistics to be most appropriate in this instance, rather than a student majoring, for example, in science.
It is generally agreed that in a beginner class, one of the teacher's tasks is to maintain the students' interest
in the language concerned. A project of this kind could prove counter-productive and possibly discourage
non-linguists. The student observed in this study had just finished his first year at university studying
linguistics with German as a subsidiary subject. At the beginning of the project, he had already completed
one year of German at university (3 hours a week) and his level of German was approximately equivalent
to basic GCSE level. However, it has to be stressed that this level is achieved within 1 year of intensive
study at university in comparison to an average of 4 years at school. It is also important to mention that
the student, unlike many other so-called "false beginners," had no knowledge of German before studying
the language at university. The student was one of the best students in his year and fond of grammar.
Language Learning & Technology
188
Elke St.John
A Case for Using a Parallel Corpus…
However, there were still doubts about whether his level of German would be good enough to cope with
some of the questions set. In particular, the language was thought to be too difficult as it was at a level to
which only more advanced learners are exposed. Consultation with the student revealed that he actually
regarded the project as a challenge.
The parallel corpus and the parallel concordancer were the learner's only resource. In the process of
answering his set of questions, he was able to teach himself how to use the concordancer without using a
manual and went on to describe the program as very user-friendly.
Student Tasks
Since the reference works most often used by undergraduate students of foreign languages seem to be
dictionaries, one of the student's first tasks consisted of word or phrase searching. In this instance, he had
to enter the word/phrase he wanted to examine. The software would then browse through the corpus of
texts and look for the wanted expression in the search language while the correspondence would be
shown in the target language parallel to the search language. Unlike KWIC (Key Word In Context)
concordancers, which show the search word centralised in a single line of text, the format for the parallel
display is the sentence and paragraph, with the results of each search being given as parallel sentences or
paragraphs. This is mainly because, although the context word is known in the search language, there is
no way of knowing where in the target language paragraph the relevant correspondence word will appear
or, indeed, if it appears at all. There is even the possibility that the required word or words may appear in
a preceding or following sentence, rather than the equivalent single sentence of the search language. In
this pilot study, the emphasis is on the behaviour of words in context in both German and English.
The student had 17 tasks to choose from. If one question/search produced too many hits he went on to the
next task, which again proves that too large a corpus would not be appropriate for a non-advanced learner
(see Aston, 1997b, p. 61). From the hits of the other tasks, he also only selected sentences he could easily
understand. Considering the learner's degree of proficiency, the level of the corpus as a whole was
probably too demanding for him, but he correctly employed a strategy of finding his own level in the
corpora by searching for shorter sentences. The examples in this paper show this.
ANALYSIS
Introduction
The set of tasks consisted of common lexical and grammar problems usually encountered by beginner
students and was therefore considered as appropriate for this study. The following results show how the
student coped with the given resources and whether he managed to find appropriate answers without the
input and guidance of the teacher.
Task 1
The very first question the student was recommended to choose was based on two phrases that are often
introduced in the first lesson of a beginner class when students have to learn phrases of introduction such
as Wie ist Ihr Name? (What's your name?) and Wie ist Ihre Telefonnummer? (What's your telephone
number?). Both interrogatives in the two questions are translated into English as what and the student was
asked whether it is a pattern that wie always translates as what and not how as described in dictionaries.
After using just wie and was as the search words in the input field of the interface which produced too
many hits, the student decided to enter ist in the context. He subsequently came up with the following
data and comments:
dbank.de 1a
dbank.en 1b
dbank.de 2a
Wie ist diese Differenz zu bewerten?
How is such a spread to be assessed?
Wie ist die Option „runde Wechselkurse" zu bewerten?
Language Learning & Technology
189
Elke St.John
dbank.en 2b
dbank.de3a
dbank.en3b
dbank.de4a
dbank.en4b
A Case for Using a Parallel Corpus…
How is the option "round exchange rates" to be assessed?
Was ist die EWU?
What is EMU?
Was ist die Alternative zur EWU?
What is the alternative to EMU?
In general was translates into English as "what." However, anyone with a basic
knowledge of German knows that there are cases where wie equates to "what" in English.
The examples in the question show this. The system did provide examples where wie
translates as "how" and from this evidence a student of German would conclude that, in
general, wie equals "how" in English except in certain cases.
The above phrases were recommended to the student as the basis of his very first question because it
required a simple search for a particular phrase with which the student was very familiar; and it also
involved a simple examination of the meaning. It is also worth pointing out that the student felt
sufficiently independent enough to go a step further when there were too many hits for was and wie and
he then inserted an ist into the context field of the interface in order to reduce the number of hits. Even
though it was the very first question, the student did not ask for the tutor's assistance but just tried to find
a solution for himself, which is also very rewarding from a teacher's perspective.
Meaning of Particles
In the next set of tasks, the learner was asked to find out how certain German modal particles and
conjunctions translate into English. In this case, all he had to search for was a particular particle and then
examine the correspondence. Doherty (1982, p. 95) stated that the English language has no equivalents
for these modal particles, so it was interesting to see what solutions the learner would actually provide.
Task 2
The first search term was wohl which produced 57 hits altogether (see Table 1 in Appendix A). The
particle wohl gives the sentence a sense of uncertainty that is required in these kinds of texts (Helbig,
1994, p. 238). What was striking was that 41 of the 57 hits occurred in the dbank file alone. One would
probably expect to find most hits in the dbank file considering that in financial reports many forecasts are
made for future years that are based on hypotheses. The student produced many concordances and also
categorised them (see Appendix B). He commented as follows:
"Wohl" produced an interesting batch of searches. The general trend was that "wohl"
introduced doubt into the sentence/paragraph. These were broken down into: "Wohl erst";
" wohl aber/aber wohl"; "wohl auch/wohl auch nicht"; "werden wohl"; "wohl nicht."
When the English translations were read in conjunction with the German, it was noticed
that most of the sentences tended to say: "probably"; "will probably"; "may well"; "is
likely" etc.
The general feeling when reading these sentences/paragraphs is one of doubt or caution
and the word "wohl" appears with one of the aforementioned words.
From a teacher's point of view, the student's investigations are more than satisfactory because he managed
to deduce the right meaning and quite rightly discovered the uncertainty of wohl. His attention was not,
however, drawn to the fact that the majority of the hits were in the dbank file. The comments show
nevertheless in what detail the student observed the concordance output. It becomes apparent that he no
longer writes about a translation as in the first search. He probably started to realise that there is not
always a one-to-one equivalent available. This can be very rewarding for the teacher who might find it
very frustrating that s/he is not always able to provide the student with one definite answer. The student's
comments also show how reading in the foreign language is practised whilst searching through the target
Language Learning & Technology
190
Elke St.John
A Case for Using a Parallel Corpus…
language to find patterns. It is moreover interesting to note how the student grouped the different
meanings of wohl according to its collocation and meaning.
Task 3
The next search term was also which can be used either as a particle or an adverb depending on syntax
and context. It gives a sentence a sense of conclusion and is also used as a connective particle between
two successive sentences (Helbig, 1994, p. 86-87). Furthermore, also belongs to the category of false
friends (Pascoe & Pascoe, 1985, p. 12). Beginner students very often translate it as also into English
whilst auch is in fact the correct German word for also.
The student's search produced 74 hits altogether, probably too many for a low-level student to work
through (see Table 2 in Appendix A). The student decided to only work on the following output with the
following explanations afterwards:
dbank.de 5a
dbank.en 5b
dbank.de 6a
dbank.en 6b
Es kann also kaum einen Zweifel daran geben, daß die EWU kommt - wenn
der politische Wille stark genug ist und genügend Länder die
Aufnahmeprüfung bestehen.
There can, therefore, be little doubt that EMU will come - if the political will
is strong enough and a sufficient number of countries pass the convergence
examination.
Da sich also der Umstellungskurs an Devisenmarktkursen orientieren wird, ist
durch die Festlegung der Umrechnungskurse weder ein Gewinn noch ein
Verlust zu erwarten.
Since, therefore, the conversion rate will be geared to forex market rates,
fixing the conversion rates should produce neither a profit nor a loss.
The examples, "also" revealed a pattern and the English translation was "therefore." The position in the
sentence in German corresponded with the position in English in almost all cases. It would appear from
the searches that, when "also" translates as "therefore," the relative position of the word in both languages
is the same or very near.
Another pattern appeared where "also" translated as "thus." This was deduced because there appeared to
be no other function for the word in the sentence. Unlike the translation "therefore," the relative position
in each language varied. However, the translation could be worked out by reading the German and then
the English. When the two were then compared, a deduction was made. The examples below demonstrate
this.
herzgog.de 7a
herzgog.en 7b
herzgog.de 8a
herzgog.en 8b
Auch in Zukunft muß das Motto also heißen: Freiheit ist das höchste Gut.
Thus, in the future as well, our motto must be: Freedom is our most precious
asset.
Wir stehen also nicht ohne Orientierung da.
Thus we do not stand here devoid of orientation.
The learner discovered its correct function as a modifier in at least four examples. Although the question
did not ask for a pattern in terms of word order, the learner mainly concentrated on this aspect. This
might be due to the fact that the student had a linguistic background and natural interest in exploring more
but this also shows very interesting aspects of using concordances with students, namely the experience
they gain of how the languages operate. It also demonstrates that he was examining and comparing the
languages and developing some insight into both languages simultaneously. This example also shows that
the English translations can prove to be very useful to the learner.
Language Learning & Technology
191
Elke St.John
A Case for Using a Parallel Corpus…
Task 4
The next search word was eben, which only produced 15 hits with 11 hits alone in the herzgog file (see
Table 3 in Appendix A).
Eben is used as an adjective, adverb, or particle; in the latter case its meaning being very difficult to
determine (Helbig, 1994, p. 124). This fact was also discovered by the student and the particle use is not
found much in written language. That is why most hits occurred in the herzgog file, that is, the
transcription of President Herzog's speech. König remarks in this respect that some scalar particles like
eben "have a wider use in English than their German 'counterparts,' in other words, some particles in
English will have several translational equivalents in German" (1982, p. 79). Thus the exact opposite can
apply when working from English to German. It is interesting to examine the student's findings:
herzgog.de 9a
herzgog.en 9b
herzgog.de 10a
herzgog.en 10b
Und nach allem, was ich eben über das europäische Erbe gesagt habe, wäre
eine undemokratische Lösung auch eine uneuropäische Lösung.
And after all that I have just said about the European inheritance, an
undemocratic solution would also be an "un–European" solution.
man wechselte eben zu anderen.
one simply changed to others.
"Eben" appeared only 15 times in the corpora. In the searches below, "eben" seems to
equate to "just" in English. When not translated exactly as "just," "just" seems to be
implied as in example 10 where "eben" equates to "simply": "simply" could easily be
replaced with "just" and carry the same meaning.
Looking through the other examples from the corpora, there were many interpretations,
which could have been made for the translation of "eben."
These data and his comments suggest that the student is becoming aware that a word may not even be
lexicalised at all in one language. This is a very important learning process and linguistic insight into
languages for a student to grasp when starting to study a foreign language. The fact that he was not taught
this but that he could find it out for himself is one of the most valuable aspects of concordancing and from
the teacher's point of view very satisfactory. It is not easy for the teacher to tell students that there is just
no translation available. It is more rewarding for both sides if the students can find out this fact himself.
Task 5
The next search term was the particle doch, which produced 170 hits altogether (see Table 4 in Appendix
A): The particle doch has seven different uses as a modal particle (Helbig, 1994, p. 111-119). Its main use
is adversative in contradictions (Helbig, p. 119). The student carefully chose to work on the following
output:
newsjan.de 11a
newsjan.en 11b
newsjan.de 12a
newsjan.en 12b
dbank.de 13a
dbank.en 13b
dbank.de 14a
Schmuggelplutonium stammt angeblich doch aus Moskau.
Smuggled plutonium indeed from Moscow
Nichts fuer sensible Gemueter - aber leider doch passiert
Not for the faint-hearted - but it did happen
Zwar ist es seiner Meinung nach zu früh, um einen Erfolg oder ein
mögliches Scheitern der EWU vorauszusagen, doch sieht er die Strukturen,
auf denen die EWU aufbaut, als durchaus vernünftig an.
Although it is too early to tell, in his opinion, whether or not EMU will
succeed, its design does make sense.
Dies hätte zweifellos negative Auswirkungen auf Spaniens
Haushaltsposition, doch wären diese sehr viel geringer als im Falle Italiens.
Language Learning & Technology
192
Elke St.John
dbank.en 14b
A Case for Using a Parallel Corpus…
While a collapse of EMU would undoubtedly have a negative impact on
Spain's budget position, the effect would be considerably smaller than in the
case of Italy.
He described the output as follows:
What seemed to be evident was that the word "doch" had a modifying effect on the
sentence. In the sentences, "doch" seems to refer to words like "indeed' and "did."
In other examples, "doch" has many uses: One of which is to add a positive nature to a
sentence. In trying to find a trend for its use in German, there was also evidence that it
had a positive modifying effect on a sentence. However, this was not the only use for the
word. It soon became clear that "doch" is used in a variety of subtle ways to shape a
phrase or sentence. Some good examples of the versatility of "doch" can be seen when it
is used at the beginning of a sentence. In some of these examples, "doch" translates into
"but" in English.
The above example again shows that not only detailed reading in the target language is practised when
using corpora but also that text analysis is employed merely by going through the data and trying to find
patterns when analysing the sentences carefully.
Task 6
Strictly speaking, this was not a task set but a search, which was initiated by the learner himself. It shows
that the learner adopted a very interesting behaviour pattern, which might be ascribed to the fact that he
has a linguistic background. After searching several German particles, the student spotted however
several times and started becoming curious about the German correspondence. As a result, he carried out
a search of however to investigate what translation the system would come up with:
The purpose of this exercise was to test the corpora when the search words found by the
system varied. Here, the English word "however" was entered and the search found
different German translations.
When this happened, it was decided to try and cross-reference the German word in each
case. The corpora produced many other examples for each word but the idea here was to
test whether the same reference could be found in the corresponding German search.
In this way the student can find what he/she wants to know by using a search in either
language. This is useful if the student is weak in either language and needs to find a
particular answer.
dbank.en 15a
dbank.de 15b
dbank.en 16a
dbank.de 16b
dbank.de 17a
dbank.en 17b
Formal participation in the exchange rate mechanism is, however, a binding
condition of the Treaty.
Die formale Teilnahme am Wechselkursverbund ist jedoch im Vertrag
zwingend vorgeschrieben.
However, a lasting improvement will probably only occur when Switzerland's
economic outlook brightens.
Allerdings sollte eine nachhaltige Stärkung wohl erst einsetzen, wenn auch
die konjunkturellen Perspektiven der Schweiz sich verbessern.
Die formale Teilnahme am Wechselkursverbund ist jedoch im Vertrag
zwingend vorgeschrieben.
Formal participation in the exchange rate mechanism is, however, a binding
condition of the Treaty.
Language Learning & Technology
193
Elke St.John
A Case for Using a Parallel Corpus…
The student came to a constructive conclusion and the fact that he carried out a cross-reference shows his
interest in research and exploring. Here it would be most interesting to see whether a more typical
language learner, that is, one without a linguistic background would behave in the same way. It also
demonstrates the fact that concordances allow students to generate and collate the language data needed
to invent their own rules of grammar and to develop the most appropriate ways of learning for
themselves. This example clearly shows that the learner assumed control of his learning process. Once the
student had seen how to use the program, he could, to a certain extent, set his own agenda for its use, as
illustrated above with however and the cross-reference research.
Grammatical and Lexical Tasks
Task 7
Another trouble spot for English learners of German is the distinction of aber and sondern both
translating as "but" into English. For this reason, the student was asked to find a possible semantic and/or
syntactical distinction between the two. The concordance below helped him to grasp the difference almost
by himself.
There were 576 hits of aber whereas sondern only showed 178 entries the latter having a specific use (it
only occurs after a preceding negative clause), which can also explain the fewer entries (see Tables 5 and
6 in Appendix A). Aber is also used as a co-ordinating conjunction and has two different uses as a modal
particle (Helbig, 1994, p. 80-81). This can be another reason why on the whole there are more hits for
aber. However, the frequency of a particle obviously also depends on the nature of the text. The subject
came up with the following data and conclusion:
dbank.de 18a
dbank.en 18b
dbank.de 19a
dbank.en 19b
Schuldenstand: Rückläufig, aber immer noch hoch
Public sector debt: falling, but still high
Aber kann man da sicher sein?
But can we be sure here?
"Aber" translates into English as "but" when the sentence uses "but" as a straight-forward
conjunction linking two main clauses of the sentence. The examples show the use of
"aber" and the search produced many more examples. English also uses the word "but"
when the German uses the word "sondern." "Sondern" is used in a different way to "aber"
although it still translates as "but." "Sondern" is used when the sentence has a negative
preceding the word.
euro.de 20a
euro.en 20b
euro.de 21a
euro.en 21b
Deshalb strebten die Gründerväter der Europäischen Gemeinschaften nach einer
gemeinsamen Energiepolitik: nicht etwa als Selbstzweck, sondern als Motor
für die politische Integration.
Appreciating its importance, the founding fathers of the European Community
desired an energy policy not only for itself but also as a motor for political
integration.
Der Europäische Rat gibt seiner ernsten Besorgnis Ausdruck über die
anhaltende Gewalt im Gebiet der Großen Seen, von der nicht nur Ost-Zaire,
sondern auch Burundi betroffen ist.
The European Council expresses grave concern about the continuing violence in
the Great Lakes Region, not only in Eastern Zaire but also in Burundi.
As with "aber," there were many examples in the files which could also have been shown here. The
system did throw up what looks like an exception. However, this could be correct in the context of this
Language Learning & Technology
194
Elke St.John
A Case for Using a Parallel Corpus…
particular sentence and without the time to explore more of the files, it will have to remain an exception in
this project:
dbank.de 22a
dbank.en 22b
Um die — nicht kurzfristig, möglicherweise aber auf längere Sicht —
bestehenden Risiken von Sanktionen im Rahmen des Stabilitätspakts zu
minimieren…
In order to minimize the risks of sanctions in the framework of the stability
pact — not so much on a short-term horizon but possibly over the longer
term…
The student observed quite rightly that aber and not sondern occurred after nicht. In a classroom
situation, students typically react negatively to the introduction of an exception to a rule but, by taking
over control of his own learning, the student even analyses the exception he found. Also from the
teacher's point of view, it is a better outcome to let students search for exceptions rather than merely
presenting it to them. The reaction of the students will be more positive and learners should in turn be
motivated if they can find such things for themselves, though it could be argued that, for this particular
question, a monolingual context would be sufficient. The student, however, bearing in mind his level,
always found it helpful to have the translation available. He also mentioned in his feedback that he
learned new words by reading both the German and the English translation.
With regard to distinctions made in a target language and non-existent in the learner's language, Barlow
comments concerning a distinction in Spanish, which is non-existent in English:
By studying the context of instances of English for that correspond to Spanish por,
compared with those that correspond to para, it is possible to form hypotheses about
which of the meanings of for match up with por and which with para. (1996b, p. 54)
As can be seen above, the strategy Barlow describes is exactly practised by the learner who managed to
work out the distinction for himself.
Task 8
The student had to find out a meaning for denn, a word many beginner students tend to equate with then,
especially when it occurs at the beginning of a sentence. Denn occurred 75 times in all files together (see
Table 7 in Appendix A).
It has seven different uses as a modal particle and is also used as causal conjunction and adverb (Helbig
1994, p. 105-110). The learner's comments regarding his data were that denn at the beginning of the
sentence translates as for. However, he also discovered that it also occurs within commas with the words
es sei. He concluded quite rightly that it then always translates as unless (see Appendix C).
In the searches here "denn" at the beginning of the sentence translates as "for," however
"denn" occurs within commas with the words: "es sei." This translates on all occasions as
"unless."
It is very interesting that the student discovered that denn collocates with es sei, that is,. es sei denn. This
demonstrates that concordancing makes hidden structures visible, and enhances the imagination.
Task 9
The last question the student chose was more challenging and complex in my judgement. He was asked to
find possible meanings for man. With the word man, German has a very useful all-purpose impersonal
pronoun that the 242 hits in all files reflect. The student produced the following data (see Table 8 in
Appendix A):
Language Learning & Technology
195
Elke St.John
herzgog.de 23a
herzgog.en 23b
herzgog.de 24a
herzgog.en 24b
newsjan.de 25a
newsjan.en 25b
A Case for Using a Parallel Corpus…
Aus der eigenen Geschichte lernt man immer noch am besten.
One's own history teaches one the best lesson.
Man sah weg, als jüdischen Ärzten und Rechtsanwälten die Zulassung
entzogen wurde;
One looked away when Jewish doctors and lawyers lost their licences;
Ausserdem koenne man den Menschen nicht anlasten, dass sie keine Arbeit
faenden.
She added that no one could be held liable for not being able to find a job.
The subject wrote afterwards:
"Man" generally translated into English as the pronoun "one." It appeared in different
places in the sentence including the beginning. The examples demonstrate this very
clearly. The examples also show that "man" does not always have an apparent translation
but when the sentence is read as a whole, it would appear that "man" is being used to
refer to the general idea, i.e. "it" or the situation in general. Furthermore "man" tends to
refer to "people," "nobody," "we" (the people). There were many examples in the corpora
like these when "man" was used to refer to someone or something.
His comments show once more how concerned he was about word order. They further indicate that he
again looked for a translation but in the end accepted that there is not always a translation for one
particular word.
This analysis provides an illustration of how the common content of parallel corpora can be exploited to
gain linguistic insights into the structure and function of languages. However, it must also be stressed
again that only one student used the corpora and concordancer on a self-access basis. Multiconcord was
installed on a computer in the Self-Access-Centre where the student could use it as and when he wanted
during open hours. Given that there was no tutor observation during the project period, even of the data
that the learner ultimately produced, it is remarkable to see that a beginner student of German can actually
discover and learn on his own. In answering my initial question, all his answers can be regarded as fully
satisfactory and appropriate with regard to the language learning process. In most questions, the student's
conclusions were the only correct answer. However, considering that the student might have shown a
natural interest in exploring the data in more detail, taking into account that his main subject of study was
linguistics, any generalisations drawn from this study need confirmation. The next step would be to
include students from other subjects like engineering or science and to see whether they come up with the
same and or similar conclusions before expanding the experiment to a whole beginners' class.
Student Observation and Feedback
The student was interviewed after the pilot study; there was no student/teacher interaction during the
project time. The learner found the concordancer very user-friendly and he did not use any tools other
than the corpora and the concordancer. He later said that he ignored sentences that were too difficult due
to a long and complicated word order, that is, he selected the sentences he wanted to use for the data,
which in itself is a very important "help yourself" learning strategy. Indeed, the data used consisted of
sentences only without any complex structures.
This obviously means that the student's analysis is incomplete because, in order to reach reliable
conclusions, all data should be considered and analysed. However, it was certainly a step forward in the
learning process of a beginner student as it enabled him to draw certain conclusions about the language
based on short and simple sentences. It was interesting, to see that the student used the corpus in two
Language Learning & Technology
196
Elke St.John
A Case for Using a Parallel Corpus…
ways: to answer the set questions and to look up things that were not directly related to the questions, for
example his search for however.
He spent on average 2 hours on each question but he noticed that he became more efficient after each
question. His explanation for this was that he became more confident in the course of time, that he knew
what to do, and also that he became more used to the system. He also knew what to look for because he
became more selective by choosing shorter sentences. The learner followed the following procedure:
After selecting a question, he first tested how many hits it produced. If there were enough hits but not too
many to cope with, the concordanced evidence of this point was assembled in both languages. He then
tried to find prominent features and classified them into up to four categories. The student then saved the
sentences and/or printed them out. He tried to discover a pattern in the language and, by generalising
found the rules, which governed those patterns (see Johns, 1991a, p. 4). The student's work became more
exploratory and thus motivating and highly experimental. In addressing the theme of this study, that is,
whether corpora and concordancer are appropriate tools at beginners' level, it can be said that the student
not only found the meaning of the search words (i.e., learned new vocabulary), but he also had the
satisfactory feeling of having achieved something.
CONCLUSION
In this paper, I have shown the use of parallel corpora and concordance software, in particular its
usefulness in the very early stages of language acquisition for both teacher and learner alike. Learners
often pose questions and answers that teachers cannot predict. A corpus and concordance can supplement
the teaching. As Johns put it, "we simply provide the evidence needed to answer the learner's questions,
and rely on the learner's intelligence to find answers" (1991a, p. 2).
In view of the degree of proficiency in German this student had, it was the correct decision to concentrate
mainly on lexical questions. These were indeed neither easy nor straightforward. This pilot study proves
that, when the translation is available, even beginner students can make use of concordancing. German
was in most cases (except in the search for however) the search language and English was used to help
understand the German.
In this pilot study, the selected student might be regarded as a rather untypical learner and therefore
further research must involve more typical language learners to find out whether low level language
students can generally cope with corpus work. Nevertheless, when carrying out a study on a bigger scale,
the two groups of typical and untypical learners have to be clearly distinguished. It was important
however to first carry out a pilot study of this kind with one student to avoid any possible failures, which
could have lead to a demotivation of the students. This experiment must be seen as a pilot study to design
more carefully prepared, objective, large scale experiments. For that reason, I would like to address the
following issues:
Firstly, the data and subsequently the answers obtained here are relevant and appropriate for this
particular pilot study. The data represents language that has been used in authentic and naturally occurring
communicative situations.
Secondly, the conclusions cannot be generalised because of the nature of the student and also because of
the fact that the student did not consider all data. The choice of student has also effected the outcome and
a study on a bigger scale will provide an answer.
Finally, this study supports Zanettin's (1994, p. 108) claim that the interactive concordancer is a potential
learning resource, which can be used freely and on their own initiative by all students from beginner to
advanced in a self-access centre. The role of the teacher/language adviser is to suggest points at which the
interactive concordancer may help to solve learning difficulties or, with instructional sheets, to explain the
background for the activity and to give operational directions.
Language Learning & Technology
197
Elke St.John
A Case for Using a Parallel Corpus…
The use of parallel corpus and concordancing in the early stages of a German learning programme can
add to grammar teaching and certainly make the work with new vocabulary more interesting and
rewarding. As already stated, preferably, the study should be repeated on a larger number of students and
on other types of students before conclusions are drawn as to whether a non-advanced learner of German
can actually benefit from using the concordancer and a parallel corpus. I, however, strongly believe that
corpora and concordancing are of great potential value in the very early stages of a language learning
programme and I am positive that further studies will reinforce my claim.
APPENDIX A
Table 1
WORD
WOHL
FILE
all
dbank
HITS
57
41
FILE
All
HITS
74
FILE
All
Herzgog
HITS
15
11
FILE
All
HITS
170
FILE
All
HITS
576
Table 2
WORD
ALSO
Table 3
WORD
EBEN
Table 4
WORD
DOCH
Table 5
WORD
ABER
Table 6
WORD
SONDERN
FILE
all
HITS
178
Table 7
WORD
DENN
FILE
all
HITS
75
FILE
all
HITS
242
Table 8
WORD
MAN
Language Learning & Technology
198
Elke St.John
A Case for Using a Parallel Corpus…
APPENDIX B
Wohl erst
dbank.de 1a
dbank.en 1b
Der Großteil des privaten Bankgeschäfts wird aber wohl erst umgestellt, wenn auch die
Euro-Banknoten und Münzen eingeführt werden.
The bulk of retail banking business will, however, probably not make the switch until
euro notes and coins are introduced.
Wohl aber/ aber wohl
dbank.de 2a
dbank.en 2b
dbank.de 3a
dbank.en 3b
Daran wird die EWU aber wohl nicht scheitern.
But EMU is unlikely to fail because of this.
1996 nicht signifikant, wohl aber in Relation zum IEP.
this does not apply to the IEP.
Wohl auch
dbank.de 4a
dbank.en 4b
Beide Staaten werden wohl auch hohe D-Mark-Anteile in der Reservehaltung aufweisen.
Dies wird vor allem für Österreich vermutet.
Both countries, particularly Austria, probably hold a large proportion of their reserves in
DEM.
Wohl auch nicht
dbank.de 5a
dbank.en 5b
Die Stärke eines Finanzplatzes hängt allerdings nicht nur von der Marktgröße ab, also von
der Höhe der Staatsverschuldung eines Landes, sie sollte es wohl auch nicht.
However, a financial centre's strength and attractiveness does not (and should not!) solely
depend on the amount of government paper available, i.e. on the size of the public debt.
Werden wohl
dbank.de 6a
dbank.en 6b
Der Anteil an den offiziellen Devisenreserven der Welt wird wohl über das Niveau der
jetzigen Währungen des Wechselkursmechanismus, das bei etwa achtzehn Prozent liegt,
hinaus anwachsen.
Its share in world foreign exchange reserves may well rise to a level above the combined
18 per cent of the major ERM currencies today.
Wohl nicht
dbank.de 7a
dbank.en 7b
Zweifel an der Erfüllung des Maastrichter Zinskriteriums bestehen wohl nicht mehr; die
weit weniger als in den EWS-Kernländern vorangeschrittene Zinskonvergenz könnte
vielmehr in absehbarer Zukunft eine treibende Kraft der irischen
Kapitalmarktbewegungen bleiben.
Doubts about Ireland meeting the Maastricht interest rate criterion appear to have
vanished: interest rate convergence, which has not progressed in Ireland nearly as far as
in the EMS core countries, could well remain a driving force in the Irish capital market
in the foreseeable future.
Language Learning & Technology
199
Elke St.John
A Case for Using a Parallel Corpus…
APPENDIX C
herzgog.de 8a
herzgog.en 8b
un.de 9a
un.en 9b
dbank.de 10a
dbank.en 10b
dbank.de 11a
dbank.en 11b
Denn die Zukunft gestaltete sich anders, als es die meisten am 8. Mai 1945 erwarteten,
auch anders, als es dem soeben zitierten Dichterwort eigentlich entsprochen hätte.
For the future turned out differently from most people's expectations on 8 May 1945
and from the image conveyed by the prayer I have just quoted.
Denn es mag für unseren Planeten, der nunmehr aus anderen Gründen nach wie vor in
Gefahr schwebt, nicht noch eine dritte Chance geben.
For there may not be a third opportunity for our planet which, now for different reasons,
remains endangered.
ob das Verhältnis des geplanten oder tatsächlichen öffentlichen Defizits zum
Bruttoinlandsprodukt einen bestimmten Referenzwert überschreitet, es sei denn, daß
entweder das Verhältnis erheblich zurückgegangen ist.
whether the ratio of the planned or actual government deficit to gross domestic product
exceeds a specified reference value, unless either the ratio has declined substantially.
Die schwedische Regierung dürfte 1997 mit einem „sanften Nein" gegen die EWU
stimmen, es sei denn, es gelingt ihr, die schwedischen Wähler umzustimmen.
The Swedish government is likely to opt for a "soft no" to EMU in 1997, unless it is
able to reverse public opposition to the single currency.
ABOUT THE AUTHOR
Elke St.John is German Co-ordinator at Modern Languages Teaching Centre at the University of
Sheffield in the United Kingdom. Her research interests include corpus-based translation studies and
corpus-based learning and legal translation.
E-mail: E.StJohn@sheffield.ac.uk
REFERENCES
Aston, G. (1997a). Enriching the learning environment: Corpora in ELT. In A. Wichmann, S. Fligelstone,
T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 51-64). New York: Longman.
Aston, G. (1997b). Small and large corpora in language learning. In B. Lewandowska-Tomaszcyk & P. J.
Melia (Eds.), Practical applications in language corpora (pp. 51-62). Lodz, Poland: University Press.
Baker, M. (1993). Corpus linguistics and translation studies -- Implications and applications. In M. Baker,
G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology (pp. 233-250). Philadelphia: John
Benjamins.
Baker, M. (1995.) Corpora in translation studies: An overview and some suggestions for future research.
Target 7(2), 223-243.
Baker, M., Francis, G., & Tognini-Bonelli, E. (Eds.). (1993). Text and technology. Philadelphia: John
Benjamins.
Barlow, M. (1995a). A guide to ParaConc. Houston, TX: Athelstan.
Barlow, M. (1995b). A concordancer for parallel texts. Computers and Texts, 10, 14-16.
Barlow, M. (1996a). Corpora for theory and practice. International Journal of Corpus Linguistics, 1(1),
1-37.
Language Learning & Technology
200
Elke St.John
A Case for Using a Parallel Corpus…
Barlow, M. (1996b). Parallel texts in language teaching. In S. Botley, J. Glass, T. McEnery, & A. Wilson
(Eds.), Proceedings of teaching and language corpora 1996 (pp. 45-56). Lancaster, UK: UCREL
Technical Papers Volume 9.
Buyse, K. (1997). The study of multi- and unilingual corpora as a tool for the development of translation
studies: A case study. Unpublished doctoral dissertation, Katholieke Universiteit Leuven, Belgium.
Danielsson, P., & Ridings, D. (1996). Corpus and terminology: Software for the translation program at
Göteborgs Universitet or getting students to do the work. In S. Botley, J. Glass, T. McEnery, & A. Wilson
(Eds.), Proceedings of teaching and language corpora 1996 (Technical Papers Volume 9; pp. 57-67).
Lancaster, UK: UCREL.
Dickens, A., & Salkie, R. (1996). Comparing bilingual dictionaries with a parallel corpus. In M.
Gellerstam, J. Järborg, S. G. Malgren, K. Norén, L. Rogström, & C. Röjder Papmehl (Eds.), EUROLEX
'96 proceedings I –II (pp. 551-559). Göteborg, Sweden: Göteborg University Department of Swedish.
Doherty, M. (1982). Epistemische Ausdrucksmittel im Deutschen und Englischen [Epistemic means of
expressions in German and English]. Fremdsprachen, 26, 92-97.
Dodd, B. (1997). Exploiting a corpus of written German for advanced language learning. In A.
Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 131145). New York: Longman.
Fernández-Villanueva, M. (1996). Research into the functions of German modal particles in a corpus. In
S. Botley, J. Glass, T. McEnery, & A. Wilson (Eds.) Proceedings of teaching and language corpora 1996
(Technical Papers Volume 9; pp. 83-93). Lancaster, UK: UCREL
Flowerdew, J. (1993). Concordancing as a tool in course design. System, 21(2), 231-244.
Flowerdew, J. (1996). Concordancing in language learning. In M. Pennington (Ed.), The power of call
(pp. 97-113). Houston, TX: Athelstan.
Francis, G. (1993). A corpus driven approach to grammar -- principles, methods and examples In M.
Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and Technology (pp. 137-156).
Amsterdam/Philadelphia: Benjamins
Helbig, G. (1994). Lexikon deutscher Partikeln [Encyclopedia of German particles]. München, Germany:
Langenscheidt.
Johansson, S. (1995). Mens Sana in corpore sano: On the role of corpora in linguistic research. The
European English Messenger, 4(2), 19-25.
Johns, T. (1986). Micro-concord: A language learner's research tool. System, 4(2), 151-162.
Johns, T. (1991a). Should you be persuaded: Two examples of data driven. ELR Journal 4, 1-16,
University of Birmingham.
Johns, T. (1991b). From printout to handout: Grammar and vocabulary learning in the context of datadriven learning. ELR Journal 4, 27-45.
King, P., & Woolls, D. (1996). Creating and using a multilingual parallel concordancer. Translation and
Meaning, 4, 459-466.
König, E. (1982). Scalar particles in German and their English equivalents. In W. F. W. Lohnes & E. A.
Hopkins (Eds.), The contrastive grammar of English and German (pp. 76-101). Ann Arbor, MI: Karoma
Publishers.
Language Learning & Technology
201
Elke St.John
A Case for Using a Parallel Corpus…
Leech, G. (1991). The state of the art in corpus linguistics. In K. Aijmer & B. Altenberg (Eds.), English
corpus linguistics: Studies in hon Lewandowska-Tomaszcyk, B., & Melia, P. J. (Eds.). (1997). Practical
applications in language corpora. Lodz, Poland: University Press.
McEnery, T., Wilson, A., & Baker, P. (1997). Teaching grammar again after twenty years: Corpus-based
help for teaching grammar. ReCALL, 9(2), 8-16.
Mindt, D. (1997). Corpora and the teaching of English in Germany. In A. Wichmann, S. Fligelstone, T.
McEnery, & G. Knowles (Eds.), Teaching and language corpora (pp. 40-50). New York: Longman.
Minugh, D. (1997). All the language that's fit to print: Using British and American newspaper CD-ROMs
as corpora. In A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.), Teaching and language
corpora (pp. 67-82). New York: Longman.
Murphy, B. (1996). Computer, corpora and vocabulary study. Language Learning Journal, 14, 53-57.
Pascoe, G., & Pascoe, H. (1985). Sprachfallen im Englischen. Wörterbuch der falschen Freunde
[Difficulties in English. Dictionary of false friends.]. München, Germany: Hueber.
Piotrowska, M. (1997). Criteria for selecting parallel texts in teaching a translation course. In B.
Lewandowska-Tomaszczyk & P. J. Melia (Eds.), Practical applications in language corpora (pp. 411420). Lodz, Poland: Lodz University Press.
Salkie, R. (1995, May). INTERSECT: A parallel corpus project at Brighton University. Computers &
Texts 9, 4-5.
Salkie, R. (1996) Modality in English and French: A corpus-based approach. Language Sciences, 18(1-2),
381-392.
Schmied, J. (1994). Translation and cognitive structures. Hermes, Journal of Linguistics, 13, 169-181.
Stevens, V. (1991a). Classroom concordancing: Vocabulary materials derived from relevant, authentic
text. English for Specific Purposes Journal 10, 35-46.
Stevens, V. (1991b). Concordance-based vocabulary exercises: A viable alternative to gap-filling. ELR
Journal, 4, 47-61.
Stevens, V. (1995). Concordancing with language learners: Why?When?What? CAELL Journal 6(2), 210.
St.John, E,. & Chattle, M. (1998.) Multiconcord: The Lingua Multilingual Parallel Concordancer for
Windows. ReCALL Newsletter, 13, 7-9.
Tognini-Bonelli, E. (1996). Towards translation equivalence from a corpus linguistics Perspective.
International Journal of Lexicography, 9(3), 197-217
Tribble, C. (1990). Concordancing in an EAP writing program. CAELL Journal, 1(2), 10-15.
Tribble, C. (1997.) Improvising corpora for ELT: Quick-and-dirty ways of developing corpora for
language teaching. In B. Lewandowska-Tomaszczyk & P. J. Melia (Eds.), Practical applications in
language corpora (pp. 106-117). Lodz, Poland: Lodz University Press.
Ulrych, M. (1997). The impact of multilingual parallel concordancing on translation. In B. LewandowskaTomaszczyk & P. J. Melia (Eds.), Practical applications in language corpora (pp. 421-435). Lodz,
Poland: Lodz University Press.
Language Learning & Technology
202
Elke St.John
A Case for Using a Parallel Corpus…
Wichmann, A. (1995). Using concordances for the teaching of modern languages in higher education.
Language Learning Journal, 11, 61-63.
A. Wichmann, S. Fligelstone, T. McEnery, & G. Knowles (Eds.). (1997). Teaching and language
corpora. New York: Longman.
Zanettin, F. (1994). Parallel words: Designing a bilingual database for translation activities. In A. Wilson
& T. McEnery, (Eds.), Corpora in language education and research: A selection of papers from Talc94
(Technical Papers, Volume 4; pp. 99-111). Lancaster, UK: UCREL.
Language Learning & Technology
203
Language Learning & Technology
http://llt.msu.edu/call_for_papers.html
September 2001, Vol. 5, Num. 3
p. 204
Call for Papers for Special Issue of LLT
Theme: Distance Learning
Guest Editor: Margo Glew
This special issue of Language Learning and Technology will focus on all aspects relating to distance
teaching and learning of languages and how both processes are best facilitated in distance education
courses. Articles must report on original empirical research in this area, or address issues in the theory
and practice of implementing distance education language courses.
Suggested topics include, but are not limited to
•
•
•
•
•
the educational context for distance learning of languages
pedagogically effective practices for distance education
crucial elements of effective distance language courses
issues of student assessment and program evaluation in distance education
new technologies in distance language learning
Please note that all articles published in LLT, including in this special issue, should either report on
original research or present an original framework that links previous research, educational theory, and
teaching practices.
Please send an e-mail of intent with a 250-word abstract by January 31, 2001 to Margo Glew
(glewmarg@msu.edu).
Language Learning & Technology is published exclusively on the World Wide Web. You may see
current or back issues, and take out your free subscription, at http://llt.msu.edu.
Copyright 2001, ISSN 1094-3501
204