Corpus Linguistics

Corpus Linguistics
Developing a
PolyU Language Bank
Sherman Lee
egslee@inet.polyu.edu.hk
PI: Grahame Bilbow
Thanks to: Chris Greaves, Raymond Cheung, Li Lan
Outline
Background
As an illustration
Exploring units of meaning

Case study
Developing a PolyU Language Bank
Goals of corpus linguistics

Types of corpora
Applications of corpus analysis
Aims and objectives of project

Similar existing projects
Procedures
The PolyU Language Bank
Current status
Sample corpora
Sample search
2
Goals of corpus linguistics
Chomskyan
linguistics
Langue
(competence)
Ideal speaker/hearer
Language = innate
mental faculty
Intuitive evidence
Universals
Grammar
Corpus
linguistics
Parole
(performance)
Complexity/variation
Language = social
phenomenon
Empirical evidence
Differences
Meaning
3
Basic tools
Corpus: a systematic collection of speech or writing

that is built according to explicit design criteria for a
specific purpose
c.f. EAGLES broad definition: A corpus can

potentially contain any text type, incl. word lists,
dictionaries, etc.
Concordancer: search engine

(e.g. WordSmith; SARA)
Concordance: occurrences of search item, displayed

in list with immediate context shown
Types of corpora
Written vs Spoken
General vs Specialised
e.g. ESP, Learner corpora
Monolingual vs Multilingual
e.g. Parallel, Comparable
Synchronic vs Diachronic; Monitor

Annotated vs Unannotated
Written corpora
Specialised corpora
Other examples of available corpora
Some applications of corpus analysis
Language teaching & learning
Empirical teaching data authentic examples of language use

Reference source answering learners questions or explaining learner
errors:
Whats the difference between at last and in the end?
How is hardly used?
Translation
Preparation of teaching materials e.g. vocabulary lists, CLOZE tests

CALL; concordancing and data-driven learning
Using parallel texts to find suitable translation equivalents
Creation of translation databases or glossaries for domain-specific
terminology, e.g. business, law, science
Exploring units of meaning in texts
Linguistics and language research
Lexicography & lexical studies e.g. relative word frequency

Language variation e.g. linguistic features across registers
Grammar corpora used as data to test hypotheses, syntactic theory
Pragmatics & discourse e.g. CA of discourse features in spoken
(conversational) data
Exploring meaning,
units of meaning
Focus on meaning because:
What are basic units of meaning?
People interested in the meanings of texts, in how language is

actually used in discourse
Meaning is a key problem for translation, language learning,
information management
Language teaching (TEFL): vocabulary often introduced in the
form of new single words
Words considered to be basic units of meaning
Is the word an ideal unit of meaning?

If you dog a dog during the dog days
of summer, youll be a dog tired dog catcher
Can I sit down? My dogs are barking
Most lexical errors made by language learners result from

failure to deal with ambiguities of single words
10
Unambiguous
Units of Meaning
Notion of an Unambiguous Unit of Meaning

necessary for understanding meaning
UUoM = keyword and all words in the context that
contribute to making the word unambiguous
Compounds, idioms, multi-word units, collocations,
set phrases
Often determined by a syntactic pattern
Adj + N
friendly fire, closing remarks
V+N
invite proposals, draw conclusions
Adv + A
politically correct, environmentally friendly
N + of + N
cause of death, proof of identity, code of practice, duty of care

11
Case study
Search for units of meaning in online dictionaries and corpora
friendly fire
environmentally friendly
Corpora from 1990s
British National Corpus (BNC)

100,000,000+ words
Written (90%)
Extracts from regional/national newspapers, specialist periodicals, academic

books, popular fiction, un/published letters, memos, school/university essays
Spoken (10%)
Informal conversation, formal meetings (business, government), radio shows,

phone-ins
The Times (1995, Jan March)
10,220,367 words
Written : business, home news, readers letters, reviews
Corpora from 1960 - 1970s
Brown corpus / LOB corpus
Each 1 million words

Written, balanced corpora of 15 genres of text
12
Search results
What the results show
friendly fire, environmentally friendly
Represent fairly new concepts

Occur in the newer corpora (1990s) as units of meaning
Occur as entries in some of the online dictionaries only
(not bilingual dictionaries)
New terminology and terms of common usage not

always recorded in dictionaries and termbanks
One way of using corpora for learning and
translation:
Use corpus evidence to help students recognise units of

meaning; introduce notion of units of meaning into
language learning
16
Aims of PULB project
To design and build an archive of language

corpora = language bank
To be used by staff and students in the

department
For teaching, language learning and research
purposes
To provide a user-friendly platform
A WWW interface via which users can freely

access the language bank
With browse, search and concordance facilities
17
Ingredients of PULB
Sources: standard corpora, departmental

collections
Medium: written texts, transcribed spoken data
Language types: native speaker, learner corpora
Languages: English, Chinese, Japanese, French,
German
Genres: business, law, academia, media, social,
literature
Target Size: 30 million
words (European) / characters (Asian)
18
Why a language bank?

- Whats in it for us
Free and simple shared access to a collection of language corpora
That you can utilise for your teaching
Authentic examples of language use at your fingertips

Empirical teaching data covering different specialisms (ESP, EAP)
That you can utilise for your research
A ready-made collection of data waiting for you to work on

Saving on time and resources
Way of incorporating new methods and information technology into

the departments teaching and research activities
Increase students awareness of this rapidly developing methodology /

branch of language studies (corpus linguistics, corpora studies)
Way of integrating theory with technology in the classroom
Train students to be more computer-literate
All of the above can
Motivate students to become active learners

Help students to more effectively learn the target language (cf goals of DDL)19
Similar existing projects
W3 Corpora Project (Essex)
http://clwww.essex.ac.uk/w3c/
Access to corpora (Gutenberg texts, LOB, LOB-tagged)
Web interface for performing searches
Online tutorial and info on corpus linguistics
Web Concordancer (VLC, PolyU)
http://vlc.polyu.edu.hk/concordance/
Access to variety of corpora and texts (bilingual/parallel
corpora, news, Bible, works of fiction)
Web interface for performing searches
20
Directions for PULB
Build a language bank with features that

parallel those of similar sites
~ VLC
Bring together corpora and texts of various types and

genres, of different languages
~ Essex
Make available different facilities for different

categories of users (cf. legal considerations)
Provide on-site tutorial, corpora-based info
Include extra features
Allow searches in multiple texts / corpora

simultaneously
Some form of parallel concordancing
30
Target composition of PULB

French
Business
Chinese
Chinese
German
Business
Japanese
Japanese
PolyU Language Bank
Legal
Chinese
Japanese
Literature
English
Legal
English
Specialised corpora
Spoken Corpora
Stude
nt
work
B
R
O
W
N
Academic
English
English
Literature
HK spoken
corpus
Conference
speeches
Socia
l
intera
ction
s
Business
English
(PUBC)
I
C
E
Academic
presentations
Teach
ing
reflect
ions
B
N
C
Learner corpora
Busin
ess
writin
g
General corpora
Workplace
English
31
Procedures (i)
Collate, sort, categorise data from

various sources
Commercially available data

Departmental collections, incl.
PolyU
Business Corpus (Li and Bilbow)

Bilingual corpora (Xu)
ESP / EAP corpora (Forey)
Learner corpora (Sengupta)
32
Procedures (ii)
For the departmental collections:
Decide how to present each collection
E.g. Sub-categories, macro categories
Clean up texts
E.g. Duplications of text samples

E.g. Structural features (headings, typographic features)
E.g. Personal information found in data
To protect anonymity or privacy of authors and speakers
Annotate texts
Provide descriptive information about each corpus

Compiler, time of compilation, type of collection
Provide descriptive information about the texts

Number, size, genre of subtexts
Bibliographic info (written text)
Ethnographic info (spoken data)
Provide structural information for texts if necessary

Mark texts for paragraph boundaries etc
33
Procedures (iii)
Put corpora together on platform; set up search

and support facilities:
PULB map
Browse facility
Search and concordance facilities
Tutorial / general information
Transplant PULB onto dept website for use by

staff and students
Promote PULB among corpora community
Data provider to data archives / distribution sites, e.g.

OLAC; ICAME
34
Current status
Range of corpora totalling 12M+ words
Individual corpus descriptions
Index of corpora
Simple to use built-in concordancer
Available at http://
langbank.engl.polyu.edu.hk/
35
Some of the currently available corpora
PolyU Business Corpus (Eng, Chi, Jap)

BNC Sampler Corpus (Spoken, Written)
Corpus of Multilingual Texts
Corpus of Nursing and Health Science Texts
Learner Corpus of Essays and Reports
HK Bilingual Corpus of Legal and Documentary
Texts
...
37
How you can contribute
Talk to us about your ideas
What would you like to see being incorporated into PULB?

In terms of corpora
In terms of search facilities and supplementary information
Can you think of other ways in which PULB can be organised

and structured?
How likely are you to make use of PULB in your teaching and
research?
Do you have any suggestions for corpus studies based on
available or potentially available corpora from PULB?
Do you know of similar projects being undertaken elsewhere
that we can learn from?
Talk to us about your collections / corpora
Do you have collections of language data from past research

projects that are (could be) presented as a corpus (corpora)?
Can we help you put your collections to good use?
Can we work together to incorporate your collections into
PULB?
41
Concluding remarks
Corpora represent a valuable but under exploited

resource for teaching and research
PULB aims to bring together various corpora
under a single departmental archive, accessible
via WWW
You can help us by contributing your ideas
and/or your language collections
Please visit and test the PULB website at http://
langbank.engl.polyu.edu.hk/ and provide us with
feedback using the online evaluation form
Thank you very much
42
Social grooming
CLOZE
PolyU Business Corpus
Compiled in 1999-2000 (Li & Bilbow)

Multilingual - comparable corpora:
English (c. 1.3 M words)

Chinese (c. 1.2 M words)
Japanese (c. 1.1 M words)
Business texts from: newspapers,

government reports, company reports
and brochures
Has been used for creating a bilingual
English-Chinese business lexicon
45
PolyU Business Lexicon
Duplication

Corpus Linguistics

Uploaded by

Copyright:

Available Formats

Corpus Linguistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Corpus Linguistics

Uploaded by

Copyright:

Available Formats

Corpus Linguistics

Exploring units of meaning

Developing a PolyU Language Bank

Goals of corpus linguistics

Aims and objectives of project

The PolyU Language Bank

Goals of corpus linguistics

Corpus: a systematic collection of speech or writing

c.f. EAGLES broad definition: A corpus can

Concordancer: search engine

Concordance: occurrences of search item, displayed

e.g. ESP, Learner corpora

e.g. Parallel, Comparable

Synchronic vs Diachronic; Monitor

Other examples of available corpora

Some applications of corpus analysis

Language teaching & learning

Empirical teaching data authentic examples of language use

Preparation of teaching materials e.g. vocabulary lists, CLOZE tests

Linguistics and language research

Lexicography & lexical studies e.g. relative word frequency

Focus on meaning because:

What are basic units of meaning?

People interested in the meanings of texts, in how language is

Is the word an ideal unit of meaning?

Most lexical errors made by language learners result from

Notion of an Unambiguous Unit of Meaning

friendly fire, closing remarks

invite proposals, draw conclusions

politically correct, environmentally friendly

cause of death, proof of identity, code of practice, duty of care

Search for units of meaning in online dictionaries and corpora

Corpora from 1990s

British National Corpus (BNC)

Extracts from regional/national newspapers, specialist periodicals, academic

Informal conversation, formal meetings (business, government), radio shows,

The Times (1995, Jan March)

Corpora from 1960 - 1970s

Brown corpus / LOB corpus

Each 1 million words

What the results show

friendly fire, environmentally friendly

Represent fairly new concepts

New terminology and terms of common usage not

Use corpus evidence to help students recognise units of

Aims of PULB project

To design and build an archive of language

To be used by staff and students in the

To provide a user-friendly platform

A WWW interface via which users can freely

Sources: standard corpora, departmental

Why a language bank?

Free and simple shared access to a collection of language corpora

That you can utilise for your teaching

Authentic examples of language use at your fingertips

That you can utilise for your research

A ready-made collection of data waiting for you to work on

Way of incorporating new methods and information technology into

Increase students awareness of this rapidly developing methodology /