Corpus Linguistics
Corpus Linguistics
Corpus Linguistics
Developing a
PolyU Language Bank
Sherman Lee
egslee@inet.polyu.edu.hk
PI: Grahame Bilbow
Thanks to: Chris Greaves, Raymond Cheung, Li Lan
Outline
Background
As an illustration
Current status
Sample corpora
Sample search
2
Chomskyan
linguistics
Langue
(competence)
Ideal speaker/hearer
Language = innate
mental faculty
Intuitive evidence
Universals
Grammar
Corpus
linguistics
Parole
(performance)
Complexity/variation
Language = social
phenomenon
Empirical evidence
Differences
Meaning
3
Basic tools
Types of corpora
Written vs Spoken
General vs Specialised
Monolingual vs Multilingual
Written corpora
Specialised corpora
Translation
Exploring meaning,
units of meaning
Unambiguous
Units of Meaning
Adj + N
V+N
Adv + A
N + of + N
Case study
friendly fire
environmentally friendly
Spoken (10%)
10,220,367 words
Written : business, home news, readers letters, reviews
12
Search results
Ingredients of PULB
18
http://clwww.essex.ac.uk/w3c/
Access to corpora (Gutenberg texts, LOB, LOB-tagged)
Web interface for performing searches
Online tutorial and info on corpus linguistics
http://vlc.polyu.edu.hk/concordance/
Access to variety of corpora and texts (bilingual/parallel
corpora, news, Bible, works of fiction)
Web interface for performing searches
20
~ VLC
~ Essex
30
Business
Chinese
Chinese
German
Business
Japanese
Japanese
Legal
Chinese
Japanese
Literature
English
Legal
English
Specialised corpora
Spoken Corpora
Stude
nt
work
B
R
O
W
N
Academic
English
English
Literature
HK spoken
corpus
Conference
speeches
Socia
l
intera
ction
s
Business
English
(PUBC)
I
C
E
Academic
presentations
Teach
ing
reflect
ions
B
N
C
Learner corpora
Busin
ess
writin
g
General corpora
Workplace
English
31
Procedures (i)
32
Procedures (ii)
Clean up texts
Annotate texts
33
Procedures (iii)
PULB map
Browse facility
Search and concordance facilities
Tutorial / general information
Current status
Range of corpora totalling 12M+ words
Individual corpus descriptions
Index of corpora
Simple to use built-in concordancer
Available at http://
langbank.engl.polyu.edu.hk/
35
Concluding remarks
Social grooming
CLOZE
Duplication