Discourse Linguistics: Discourse Structure Text Coherence and Cohesion Reference Resolution
Discourse Linguistics: Discourse Structure Text Coherence and Cohesion Reference Resolution
Discourse Structure
Text Coherence and
Cohesion Reference
Resolution
Dr. Erwin L. Purcia
Post-Doctorate in Quality Management, CEU-Manila
Doctor of Arts in Language and Literature, UEP-Catarman
MA in Education major in English, CKC-Calbayog
BSEd-English, NwSSU-Calbayog
Synchronic Model of Language
Pragmatic
Discourse
Semantic
Syntactic
Lexical
Morphological
Phonetic
Discourse Linguistics
7
Discourse Segmentation
• Documents are automatically separated into passages,
sometimes called fragments, which are different discourse
segments
• Techniques to separate documents into passages include
– Rule-based systems based on clue words and phrases
– Probabilistic techniques to separate fragments and to identify
discourse segments (Oddy)
– TextTiling algorithm uses cohesion to identify segments, assuming
that each segment exhibits lexical cohesion within the segment, but
is not cohesive across different segments
• Lexical cohesion score – average similarity of words within a
segment
• Identify boundaries by the difference of cohesion scores
• NLTK has a text tiling algorithm available
8
Cohesion – Surface Level Ties
•“A piece of text is intended and is perceived as more than a
simple sequencing of independent sentences.”
• Therefore, a text will exhibit unity / texture
• on the surface level (cohesion)
• at the meaning level (coherence)
• Halliday & Hasan’s Cohesion in English (1976)
• Sets forth the linguistic devices that are available in the
English language for creating this unity / texture
• Identifies the features in a text that contribute to an
intelligent comprehension of the text
•Important for language generation, produces natural-
sounding texts
Cohesive Relations
• Define dependencies between sentences in text.
“ He said so. ”
• “He” and “so” presuppose elements in the
preceding text for their understanding
• This presupposition and the presence of information
elsewhere in text to resolve this presupposition provide
COHESION
- Part of the discourse-forming component of the linguistic
system
- Provides the means whereby structurally unrelated
elements are linked together
Six Types of Cohesive Ties
• Grammatical
– Reference
– Substitution
– Ellipsis
– Conjunction
• Lexical
– Reiteration
– Collocation
• (In practice, there is overlap; some examples can
show more than one type of cohesion.)
1. Reference
- items in a language which, rather than being interpreted in
their own right, make reference to something else for
their interpretation.
“Doctor Foster went to Gloucester in a shower of rain. He stepped in a
puddle right up to his middle and never went there again.”
Types of Reference
endophora Coreferenc
exophora e
[textual]
[situation – referring to resolution
things outside of text –
not part of cohesion]
anaphora cataphora
[preceding text] [following text]
2. Substitution:
- a substituted item that serves the same structural function as
the item for which it is substituted.
Nominal – one, ones, same
Verbal – do
Clausal – so, not
- These biscuits are stale. Get some fresh ones.
- Person 1 – I ’ ll have two poached eggs on toast,
please.
Person 2 – I ’ ll have the same.
- The words did not come the same as they used to do. I don ’ t
know the meaning of half those long words, and what ’ s
more, don ’ t believe you do either, said Alice.
3. Ellipsis
- Very similar to substitution principles, embody same relation
between parts of a text
- Something is left unsaid, but understood nonetheless, but a
limited subset of these instances
• Smith was the first person to leave. I was the
second
.
• Joan brought some carnations and Catherine
some sweet peas.
• Who is responsible for sales in the Northeast? I
believe Peter Martin is .
4. Conjunction
-Different kind of cohesive relation in that it doesn’t require us
to understand some other part of the text to understand the
meaning
-Rather, a specification of the way the text that follows is
systematically connected to what has preceded
For the whole day he climbed up the steep mountainside,
almost without stopping.
And in all this time he met no one.
Yet he was hardly aware of being tired.
So by night the valley was far below him.
Then, as dusk fell, he sat down to rest.
Now, 2 types of Lexical Cohesion
- Lexical cohesion is oncerned with cohesive effects
achieved by selection of vocabulary
5. Reiteration continuum –
I attempted an ascent of the peak. _X
was easy.
- same lexical item – the ascent
- synonym – the climb
- super-ordinate term – the task
- general noun – the act
- pronoun - it
6. Collocations
- Lexical cohesion achieved through the association of
semantically related lexical items
- Accounts for any pair of lexical items that exist in some
lexico-semantic relationship, e. g.
- complementaries
boy / girl
stand-up / sit-down
- antonyms
wet / dry
crowded / deserted
- converses
order / obey
give / take
Collocations
(cont’d)
- pairs from ordered series
Tuesday / Thursday
sunrise / sunset
- part-whole
brake / car
lid / box
- co-hyponyms
of same super-
ordinate
chair /
table
(furniture)
Uses of Cohesion Theory
1. Halliday & Hasan’s theory has been captured in
a coding scheme
• used to quantitatively measure the extent of cohesion
in a text.
• ETS has experimented with it as a metric in grading
standardized test essays.
2. When building a semantic representation of a text, the
theory suggests how the system can recognize relations
between entities.
- indicates what is related
- suggests how they are related
3. Provides guidance to a NL Generation system so that the
system can produce naturally cohesive text.
4. Delineates (for English) how the cohesive features of the
language can be recognized and utilized by an Machine
Translation system.
Lexical Chains
• Building lexical chains is one way to find the lexical
cohesion structure of a text, both reiteration and collocation.
• A lexical chain is a sequence of semantically related words
from the text
• Algorithm sketch:
– Select a set of candidate words
– For each candidate word, find an appropriate chain relying on a
“relatedness” measure among members of chains
– If it is found, insert the word into the chain.
20
Coherence Relations – Semantic Meaning Ties
• The set of possible relations between the meanings of
different utterances in the text
• Hobbs (1979) suggests relations such as
– Result: state in first sentence could cause the state in a second
sentence
– Explanation: the state in the second sentence could cause the
first
John hid Bill’s car keys. He was drunk.
– Parallel: The states asserted by two sentences are
similar
The Scarecrow wanted some brains. The Tin Woodsman wanted a
heart.
– Elaboration: Infer the same assertion from the two sentences.
• Textual Entailment
– NLP task to discover the result and elaboration between two
sentences.
21
Anaphora / Reference Resolution
• One of the most important NLP tasks for cohesion at the
discourse level
• A linguistic phenomenon of abbreviated subsequent
reference
– A cohesive tie of the grammatical and lexical
types
• Includes reference, substitution and reiteration
• 2 levels of resolution:
– within document (co-reference resolution)
• e.g. Bin Ladin = he
• his followers = they
• terrorist attacks = they
• the Federal Bureau of Investigation = FBI = F.B.I
– across document (or named entity resolution)
• e.g. maverick Saudi Arabian multimillionaire = Usama Bin
Ladin = Bin Ladin
• Event resolution is also possible, but not widely used
Examples from Contexts
1.The State Department renewed its appeal for Bin Laden on
Monday and warned of possible fresh attacks by his followers against U.S.
targets.
…
2.One early target of the F.B.I.’s Budapest office is expected to be
Semyon Y. Mogilevich, a Russian citizen who has operated out of
Budapest for a decade. Recently he has been linked to the growing
money-laundering investigation in the United States involving the Bank of
New York. Mr. Mogilevich is also the target of a separate money
laundering and financial fraud investigation by the F.B.I. in Philadelphia,
according to federal officials.
…
3.The F.B.I. will also have the final say over the hiring and firing of the
10 Hungarian agents who will work in the office, alongside five
American agents. The bureau has long had agents posted in American
embassies
Glossary of Terminology
– The German authorities3 (b) said a Colombian4 who had lived for a long
time in the Ukraine5 (c) flew in from Kiev. He had 300 grams of
plutonium 2396 in his baggage. The suspected smuggler4 (a) denied that
the materials6 (a) were his.
Pronominalization
• Pronouns refer to entities that were introduced fairly recently,
1-4-5-10(?) sentences back.
– Nominative (he, she, it, they, etc.)
• e.g. The German authorities said a Colombian1 who had lived for a
long time in the Ukraine flew in from Kiev. He1 had 300 grams of
plutonium 239 in his baggage.
– Oblique (him, her, them, etc.)
• e.g. Undercover investigators negotiated with three members of a
criminal group2 and arrested them2 after receiving the first
shipment.
– Possessive (his, her, their, etc. + hers, theirs, etc.)
• e.g. He3 had 300 grams of plutonium 239 in his3 baggage. The
suspected smuggler3* denied that the materials were his3. (*chain)
– Reflexive (himself, themselves, etc.)
• e.g. There appears to be a growing problem of disaffected loners4
who cut themselves4 off from all groups .
Indefinite noun phrases – a X, or an X
• Typically, an indefinite noun phrase introduces a new entity
into the discourse and would not be used as a referring
phrase to something else
– The exception is in the case of cataphora:
A Soviet pop star was killed at a concert in Moscow last night.
30
Demonstratives – this and that
• Demonstrative pronouns can either appear alone or as
determiners
this ingredient, that spice
• These NP phrases with determiners are ambiguous
– They can be indefinite
I saw this beautiful car today.
– Or they can be definite
I just bought a copy of Thoreau’s Walden. I had bought one five
years ago. That one had been very tattered; this one was in much
better condition.
31
Names
• Names can occur in many forms, sometimes called name
variants.
Victoria Chen, Chief Financial Officer of Megabucks Banking Corp.
since 2004, saw her pay jump 20% as the 37-year-old also became the
Denver-based financial-services company ’ s president. Megabucks
expanded recently . . . MBC . . .
– (Victoria Chen, Chief Financial Officer, her, the 37-year-old, the Denver-based
financial-services company’s president)
– (Megabucks Banking Corp. , the Denver-based financial-services company,
Megabucks, MBC )
–
32
Unusual Cases
• Compound phrases
John and Mary got engaged. They make a cute couple.
John and Mary went home. She was tired.
• Singular nouns with a plural meaning
The focus group met for several hours. They were very
intent.
• Part/whole relationships
John bought a new car. A door was dented.
33
Approach to coreference resolution
• Naively identify all referring phrases for
resolution:
– all Pronouns
– all definite NPs
– all Proper Nouns
• Filter things that look referential but, in fact, are
not
– e.g. geographic names, the United State
– pleonastic “it”, e.g. it’s 3:45 p.m., it was cold
– non-referential “it”, “they”, “there”
• e.g. it was essential, important, is understood,
• they say,
• there seems to be a mistake
Identify Referent Candidates
– All noun phrases (both indef. and def.) are considered potential
referent candidates.
– A referring phrase can also be a referent for a subsequent referring
phrases,
• Example: (omitted sentence with name of suspect)
He had 300 grams of plutonium 239 in his baggage. The
suspected smuggler denied that the materials were his.
(chain of 4 referring phrases)
– All potential candidates are collected in a table collecting feature
info on each candidate.
– Problems:
• chunking
– e.g. the Chase Manhattan Bank of New
York
• nesting of NPs
Features
• Define features between a refering phrase and each candidate
– Number agreement: plural, singular or neutral
• He, she, it, etc. are singular, while we, us, they, them, etc. are
plural and should match with singular or plural nouns, respectively
• Exceptions: some plural or group nouns can be referred to by
either it or they
IBM announced a new product. They have been working on
it …
– Gender agreement:
• Generally animate objects are referred to by either male pronouns
(he, his) or female pronouns (she, hers)
• Inanimate objects take neutral (it) gender
– Person agreement:
• First and second person pronouns are “I” and “you”
• Third person pronouns must be used with nouns
More Features
• Binding constraints
– Reflexive pronouns (himself, themselves) have constraints on which
nouns in the same sentence can be referred to:
John bought himself a new Ford. (John = himself)
John bought him a new Ford. (John cannot = him)
• Recency
– Entities situated closer to the referring phrase tend to be more salient
than those further away
• And pronouns can’t go more than a few sentences away
• Grammatical role / Hobbs distance
– Entities in a subject position are more likely than in the object
position
37
Even more features
• Repeated mention
– Entities that have been the focus of the discourse are more likely to
be salient for a referring phrase
• Parallelism
– There are strong preferences introduced by parallel constructs
Long John Silver went with Jim. Billy Bones went with him.
(him = Jim)
• Verb Semantics and selectional restrictions
– Certain verbs take certain types of arguments and may prejudice the
resolution of pronouns
John parked his car in the garage after driving it around for hours.
38
Example: rules to assign gender info
40
Summary of Discourse Level Tasks
• Most widely used task is coreference resolution
– Important in many other text analysis tasks in order to understand
meaning of sentences
• Dialogue structure is also part of discourse analysis and will
be considered separately (next time)
• Document structure
– Recognizing known structure, for example, abstracts
– Separating documents accoring to known structure
• Named entity resolution across documents
• Using cohesive elements in language generation and
machine translation
41