Seq 2 Rel

A sequence-to-sequence approach for document-level relation extraction
John Giorgi1,4,5, Gary D. Bader1,2,4,6,7,† Bo Wang1,3,5,8,†

1
Department of Computer Science, University of Toronto
2
Department of Molecular Genetics, University of Toronto
3
Department of Laboratory Medicine and Pathobiology, University of Toronto
4
Terrence Donnelly Centre for Cellular & Biomolecular Research
5
Vector Institute for Artificial Intelligence
6
The Lunenfeld-Tanenbaum Research Institute, Sinai Health System
7
Princess Margaret Cancer Centre, University Health Network
8
Peter Munk Cardiac Center, University Health Network
Corresponding author † Equal contribution
{john.giorgi, gary.bader}@mail.utoronto.ca
bowang@vectorinstitute.ai
arXiv:2204.01098v2 [cs.CL] 10 Apr 2022
Abstract this massive volume of text. An important step in

Motivated by the fact that many relations cross
this process is relation extraction (RE), the task of
the sentence boundary, there has been increas- identifying groups of entities within some text that
ing interest in document-level relation extrac- participate in a semantic relationship. In the do-
tion (DocRE). DocRE requires integrating in- main of biomedicine, relations of interest include
formation within and across sentences, captur- chemical-induced disease, protein-protein interac-
ing complex interactions between mentions of tions, and gene-disease associations.
entities. Most existing methods are pipeline-
Many methods have been proposed for RE, rang-
based, requiring entities as input. However,
jointly learning to extract entities and relations
ing from rule-based to machine learning-based
can improve performance and be more effi- (Zhou et al., 2014; Liu et al., 2016). Most of
cient due to shared parameters and training this work has focused on intra-sentence binary RE,
steps. In this paper, we develop a sequence-to- where pairs of entities within a sentence are classi-
sequence approach, seq2rel, that can learn the fied as belonging to a particular relation (or none).
subtasks of DocRE (entity extraction, corefer- These methods often ignore commonly occurring
ence resolution and relation extraction) end-to- complexities like nested or discontinuous entities,
end, replacing a pipeline of task-specific com-
coreferent mentions (words or phrases in the text
ponents. Using a simple strategy we call en-
tity hinting, we compare our approach to ex- that refer to the same entity), inter-sentence and
isting pipeline-based methods on several popu- n-ary relations (see Figure 1 for examples). The
lar biomedical datasets, in some cases exceed- decision not to model these phenomena is a strong
ing their performance. We also report the first assumption. In GENIA (Kim et al., 2003), a corpus
end-to-end results on these datasets for future of PubMed articles labelled with around 100,000
comparison. Finally, we demonstrate that, un- biomedical entities, ∼17% of all entities are nested
der our model, an end-to-end approach outper-
within another entity. Discontinuous entities are
forms a pipeline-based approach. Our code,
data and trained models are available at https:
particularly common in clinical text, where ∼10%
//github.com/johngiorgi/seq2rel. An online of mentions in popular benchmark corpora are dis-
demo is available at https://share.streamlit. continuous (Wang et al., 2021). In the CDR corpus
io/johngiorgi/seq2rel/main/demo.py. (Li et al., 2016b), which comprises 1500 PubMed
articles annotated for chemical-induced disease re-
1 Introduction lations, ∼30% of all relations are inter-sentence.
PubMed, the largest repository of biomedical lit- Some relations, like drug-gene-mutation interac-
erature, contains over 30 million publications and tions, are difficult to model with binary RE (Zhou
is adding more than two papers per minute. Accu- et al., 2014).
rate, automated text mining and natural language In response to some of these shortcomings, there
processing (NLP) methods are needed to maximize has been a growing interest in document-level RE
discovery and extract structured information from (DocRE). DocRE aims to model inter-sentence re-
Figure 1: Examples of complexities in entity and relation extraction and the proposed linearization schema to
model them. CID: chemical-induced disease. GDA: gene-disease association. DGM: drug-gene-mutation.
Complexities Example Comment

Discontinuous Induction by paracetamol of bladder and liver tumours. Discontinuous mention of
mentions bladder tumours.
paracetamol @DRUG@ bladder tumours @DISEASE@ @CID@
paracetamol @DRUG@ liver tumours @DISEASE@ @CID@
Coreferent Proto-oncogene HER2 (also known as erbB-2 or neu) plays an important Two coreferent mentions of
mentions role in the carcinogenesis and the prognosis of breast cancer. HER2.
her2 ; erbb-2 ; neu @GENE@ breast cancer @DISEASE@ @GDA@
n-ary, inter- The deletion mutation on exon-19 of EGFR gene was present in 16 patients, Ternary DGM relationship
sentence while the L858E point mutation on exon-21 was noted in 10. All patients crosses a sentence boundary.
were treated with gefitinib and showed a partial response.
gefitinib @DRUG@ egfr @GENE@ l858e @MUTATION@ @DGM@
lations between coreferent mentions of entities in discussed thus far. However, existing work stops
a document. A popular approach involves graph- short, focusing on intra-sentence binary relations
based methods, which have the advantage of natu- (Zeng et al., 2018; Zhang et al., 2020; Nayak and
rally modelling inter-sentence relations (Peng et al., Ng, 2020; Zeng et al., 2020). In this paper, we
2017; Song et al., 2018; Christopoulou et al., 2019; extend work on seq2seq methods for RE to the doc-
Nan et al., 2020; Minh Tran et al., 2020). However, ument level, with several important contributions:
like all pipeline-based approaches, these methods
assume that the entities within the text are known. • We propose a novel linearization schema that
As previous work has demonstrated, and as we can handle complexities overlooked by previ-
show in §5.2, jointly learning to extract entities ous seq2seq approaches, like coreferent men-
and relations can improve performance (Miwa and tions and n-ary relations (§3.1).
Sasaki, 2014; Miwa and Bansal, 2016; Gupta et al., • Using this linearization schema, we demon-
2016; Li et al., 2016a, 2017; Nguyen and Verspoor, strate that a seq2seq approach is able to learn
2019a; Yu et al., 2020) and may be more efficient the subtasks of DocRE (entity extraction,
due to shared parameters and training steps. Ex- coreference resolution and relation extraction)
isting end-to-end methods typically combine task- jointly, and report the first end-to-end results
specific components for entity detection, corefer- on several popular biomedical datasets (§5.1).
ence resolution, and relation extraction that are
trained jointly. Most approaches are restricted to • We devise a simple strategy, referred to as “en-
intra-sentence RE (Bekoulis et al., 2018; Luan et al., tity hinting” (§3.3), to compare our model to
2018; Nguyen and Verspoor, 2019b; Wadden et al., existing pipeline-based approaches, in some
2019; Giorgi et al., 2019) and have only recently cases exceeding their performance (§5.1).
been extended to DocRE (Eberts and Ulges, 2021).
However, they still focus on binary relations. Ide- 2 Task definition: document-level
ally, DocRE methods would be capable of mod- relation extraction
elling the complexities mentioned above without
strictly requiring entities to be known. Given a source document of S tokens, a model
must extract all tuples corresponding to a relation,
A less popular end-to-end approach is to frame R, expressed between the entities, E in the doc-
RE as a generative task with sequence-to-sequence ument, (E1 , ..., En , R) where n is the number of
(seq2seq) learning (Sutskever et al., 2014). This participating entities, or arity, of the relation. Each
framing simplifies RE by removing the need for entity Ei is represented as the set of its coreferent
task-specific components and explicit negative mentions {eij } in the document, which are often ex-
training examples, i.e. pairs of entities that do not pressed as aliases, abbreviations or acronyms. All
express a relation. If the information to extract is entities appearing in a tuple have at least one men-
appropriately linearized to a string, seq2seq meth- tion in the document. The mentions that express a
ods are flexible enough to model all complexities given relation are not necessarily contained within
Figure 2: A sequence-to-sequence model for document-level relation extraction. Special tokens are generated by
the decoder. Entity mentions are copied from the input via a copy mechanism (not shown). Decoding is initiated
by a @START@ token and terminated when the model generates the @END@ token. Attention connections shown only
for the second timestep to reduce clutter. CID: chemical-induced disease.
the same sentence. Commonly, E is assumed to be schema can be used to model various complexities,
known and provided as input to a model. We will like coreferent entity mentions and n-ary relations.
refer to these methods as “pipeline-based”. In this
paper, we are primarily concerned with the situa- 3.2 Model
tion where E is not given and must be predicted by The model follows a canonical seq2seq setup. An
a model, which we will refer to as “end-to-end”. encoder maps each token in the input to a contex-
tual embedding. An autoregressive decoder gener-
3 Our approach: seq2rel ates an output, token-by-token, attending to the out-
3.1 Linearization puts of the encoder at each timestep (Figure 2). De-
coding proceeds until a special “end-of-sequence”
To use seq2seq learning for RE, the information to token (@END@) is generated, or a maximum number
be extracted must be linearized to a string. This of tokens have been generated. Formally, X is the
linearization should be expressive enough to model source sequence of length S, which is some text
the complexities of entity and relation extraction we would like to extract relations from. Y is the
without being overly verbose. We propose the corresponding target sequence of length T , a lin-
following schema, illustrated with an example: earization of the relations contained in the source.
We model the conditional probability
X: Variants in the estrogen receptor alpha (ESR1) gene
and its mRNA contribute to risk for schizophrenia.
Y: estrogen receptor alpha ; ESR1 @GENE@ T
Y
schizophrenia @DISEASE@ @GDA@ p(Y |X) = p(yt |X, y<t ) (1)
t=1
The input text X, expresses a gene-disease associa- During training, we optimize over the model pa-
tion (GDA) between ESR1 and schizophrenia. In rameters θ the sequence cross-entropy loss
the corresponding target string Y , each relation be-
gins with its constituent entities. A semicolon sepa-
rates coreferent mentions (;), and entities are termi- T
X
nated with a special token denoting their type (e.g. `(θ) = − log p(yt |X, y<t ; θ) (2)
@GENE@). Similarly, relations are terminated with a t=1
special token denoting their type (e.g. @GDA@). Two

maximizing the log-likelihood of the training data.1
or more entities can be included before the special
relation token to support n-ary extraction. Entities The main problems with this setup for RE are: 1)
can be ordered if they serve specific roles as head The model might “hallucinate” by generating entity
or tail of a relation. For each document, multiple mentions that do not appear in the source text. 2)
relations can be included in the target string. En- It may generate a target string that does not fol-
tities may be nested or discontinuous in the input low the linearization schema and therefore cannot
1
text. In Figure 1, we provide examples of how this See §4.3 for details about the encoder and decoder.
be parsed. 3) The loss function is permutation- et al., 2016; Yang et al., 2019). To partially miti-
sensitive, enforcing an unnecessary decoding order. gate this, we sort relations within the target strings
To address 1) we use two modifications: a restricted according to their order of appearance in the source
target vocabulary (§3.2.1) and a copy mechanism text, providing the model with a consistent decod-
(§3.2.2). To address 2) we experiment with several ing order. The position of a relation is determined
constraints applied during decoding (§3.2.3). Fi- by the first occurring mention of its head entity.
nally, to address 3) we sort relations according to The position of a mention is determined by the sum
their order of appearance in the source text (§3.2.4). of its start and end character offsets. In the case
of ties, we then sort by the first mention of its tail
3.2.1 Restricted target vocabulary entity (and so on for n-ary relations).
To prevent the model from “hallucinating” (gen-
erating entity mentions that do not appear in the 3.3 Entity hinting
source text), the target vocabulary is restricted to
the set of special tokens needed to model entities Although the proposed model can jointly extract
and relations (e.g. ; and @DRUG@). All other tokens entities and relations from unannotated text, most
must be copied from the input using a copy mecha- existing DocRE methods provide the entities
nism (see §3.2.2). The embeddings of these special as input. Therefore, to more fairly compare to
tokens are initialized randomly and learned jointly existing methods, we also provide entities as input,
with the rest of the model’s parameters. using a simple strategy that we will refer to as
“entity hinting”. This involves prepending entities
3.2.2 Copy mechanism to the source text as they appear in the target string.
To enable copying of input tokens during decoding, Taking the example from §3.1, entity hints would
we use a copying mechanism (Gu et al., 2016a). be added as follows:
The mechanism works by effectively extending the
target vocabulary with the tokens in the source X: estrogen receptor alpha ; ESR1 @GENE@
sequence X, allowing the model to “copy” these schizophrenia @DISEASE@ @SEP@ Variants in the estrogen
tokens into the output sequence, Y . Our use of receptor alpha (ESR1) gene and its mRNA contribute to
the copy mechanism is similar to previous seq2seq- risk for schizophrenia.
based approaches for RE (Zeng et al., 2018, 2020).
where the special @SEP@ token demarcates the end
3.2.3 Constrained decoding of the entity hint.2 We experimented with the com-
We experimented with several constraints applied mon approach of inserting marker tokens before
to the decoder during test time to reduce the like- and after each entity mention (Zhou and Chen,
lihood of generating syntactically invalid target 2021) but found this to perform worse. Our ap-
strings (strings that do not follow the linearization proach adds fewer extra tokens to the source text
schema). These constraints are applied by setting and provides a location for the copy mechanism to
the predicted probabilities of invalid tokens to a focus, i.e. tokens left of @SEP@. In our experiments,
tiny value at each timestep. The full set of con- we use entity hinting when comparing to methods
straints is depicted in Appendix A. In practice, we that provide ground truth entity annotations as input
found that a trained model rarely generates invalid (§5.1.1). In §5.2, we use entity hinting to compare
target strings, so these constraints have little effect pipeline-based and end-to-end approaches.
on final performance (see §5.3). We elected not to
apply them in the rest of our experiments. 4 Experimental setup
3.2.4 Sorting relations 4.1 Datasets
The relations to extract from a given document are
We evaluate our approach on several biomedi-
inherently unordered. However, the sequence cross-
cal, DocRE datasets. We also include one non-
entropy loss (Equation 2) is permutation-sensitive
biomedical dataset, DocRED. In Appendix B, we
with respect to the predicted tokens. During train-
list relevant details about their annotations.
ing, this enforces an unnecessary decoding order
and may make the model prone to overfit frequent 2
Some pretrained models have their own separator token
token combinations in the training set (Vinyals which can be used in place of @SEP@, e.g. BERT uses [SEP].
CDR (Li et al., 2016b) The BioCreative V CDR correct if the relation type and its entities match
task corpus is manually annotated for chemicals, a ground truth relation. An entity is considered
diseases and chemical-induced disease (CID) rela- correct if the entity type and its mentions match
tions. It contains the titles and abstracts of 1500 a ground truth entity. However, since the aim of
PubMed articles and is split into equally sized train, DocRE is to extract relations at the entity-level
validation and test sets. Given the relatively small (as opposed to the mention-level), we also report
size of the training set, we follow Christopoulou performance using a relaxed criterion (denoted “re-
et al. (2019) and others by first tuning the model on laxed”), where predicted entities are considered
the validation set and then training on the combina- correct if more than 50% of their mentions match
tion of the train and validation sets before evaluat- a ground truth entity (see Appendix E).
ing on the test set. Similar to prior work, we filter Existing methods that evaluate on CDR, GDA
negative relations with disease entities that are hy- and DGM use the ground truth entity annotations
pernyms of a corresponding true relations disease as input. This makes it difficult to directly compare
entity within the same abstract (see Appendix C). with our end-to-end approach, which takes only the
raw text as input. To make the comparison fairer,
GDA (Wu et al., 2019) The gene-disease asso-
we use entity hinting (§3.3) so that our model has
ciation corpus contains 30,192 titles and abstracts
access to the ground truth entity annotations. We
from PubMed articles that have been automatically
also report the performance of our method in the
labelled for genes, diseases and gene-disease as-
end-to-end setting on these corpora to facilitate
sociations via distant supervision. The test set is
future comparison. To compare to existing end-to-
comprised of 1000 of these examples. Following
end approaches, we use DocRED.
Christopoulou et al. (2019) and others, we hold
out a random 20% of the remaining abstracts as a 4.3 Implementation, training and
validation set and use the rest for training. hyperparameters
DGM (Jia et al., 2019) The drug-gene-mutation Implementation We implemented our model in
corpus contains 4606 PubMed articles that have PyTorch (Paszke et al., 2017) using AllenNLP
been automatically labelled for drugs, genes, muta- (Gardner et al., 2018). As encoder, we use a pre-
tions and ternary drug-gene-mutation relationships trained transformer, implemented in the Transform-
via distant supervision. The dataset is available in ers library (Wolf et al., 2020), which is fine-tuned
three variants: sentence, paragraph, and document- during training. When training and evaluating on
length text. We train and evaluate our model on the biomedical corpora, we use PubMedBERT (Gu
paragraph-length inputs. Since the test set does not et al., 2020), and BERTBASE (Devlin et al., 2019)
contain relation annotations on the paragraph level, otherwise. In both cases, we use the default hyper-
we report results on the validation set. We hold out parameters of the pretrained model. As decoder, we
a random 20% of training examples to form a new use a single-layer LSTM (Hochreiter and Schmid-
validation set for tuning. huber, 1997) with randomly initialized weights.
DocRED (Yao et al., 2019) DocRED includes We use multi-head attention (Vaswani et al., 2017)
over 5000 human-annotated documents from as the cross-attention mechanism between encoder
Wikipedia. There are six entity and 96 relation and decoder. Select hyperparameters were tuned
types, with ∼40% of relations crossing the sen- on the validation sets, see Appendix F for details.
tence boundary. We use the same split as previ- Training All parameters are trained jointly us-
ous end-to-end methods (Eberts and Ulges, 2021), ing the AdamW optimizer (Loshchilov and Hutter,
which has 3,008 documents in the training set, 300 2019). Before training, we re-initialize the top L
in the validation set and 700 in the test set3 . layers of the pretrained transformer encoder, which
4.2 Evaluation has been shown to improve performance and stabil-
ity during fine-tuning (Zhang et al., 2021b). During
We evaluate our model using the micro F1-score by training, the learning rate is linearly increased for
extracting relation tuples from the decoder’s output the first 10% of training steps and linearly decayed
(see Appendix D). Similar to prior work, we use a to zero afterward. Gradients are scaled to a vector
“strict” criteria. A predicted relation is considered norm of 1.0 before backpropagating. During each
3
https://github.com/lavis-nlp/jerex forward propagation, the hidden state of the LSTM
Table 1: Comparison to existing pipeline-based meth-
ods. Performance reported as micro-precision, recall
and F1-scores (%) on the CDR and GDA test sets. Re-
sults below the horizontal line are not comparable to
existing methods. Bold: best scores.
CDR GDA
Method P R F1 P R F1
Christopoulou et al. (2019) 62.1 65.2 63.6 – – 81.5
Nan et al. (2020) – – 64.8 – – 82.2 Figure 3: Effect of training set size on performance.
Minh Tran et al. (2020) – – 66.1 – – 82.8
Performance reported as the median micro F1-score ob-
Lai and Lu (2021) 64.9 67.1 66.0 – – –
Xu et al. (2021) – – 68.7 – – 83.7 tained over five runs with different random seeds on the
Zhou et al. (2021) – – 69.4 – – 83.9 CDR and GDA validation sets, with and without entity
seq2rel (entity hinting) 68.2 66.2 67.2 84.4 85.3 84.9 hinting. Error bands correspond to the standard devia-
seq2rel (entity hinting, relaxed) 68.2 66.2 67.2 84.5 85.4 85.0 tion over the five runs. The absolute number of training
seq2rel (end-to-end) 43.5 37.5 40.2 55.0 55.4 55.2
seq2rel (end-to-end, relaxed) 56.6 48.8 52.4 70.3 70.8 70.5
examples are displayed for each corpus. Some labels
are excluded to reduce clutter.
decoder is initialized with the mean of token em-

beddings output by the encoder. The decoder is exploits the entity annotations. The fact that re-
regularized by applying dropout (Srivastava et al., laxed entity matching makes a large difference
2014) with probability 0.1 to its inputs, and Drop- in the end-to-end setting (+12-15%) suggests that
Connect (Wan et al., 2013) with probability 0.5 a significant portion of the model’s mistakes oc-
to the hidden-to-hidden weights. As is common, cur during coreference resolution. Although our
we use teacher forcing, feeding previous ground method is designed for end-to-end RE, we find
truth inputs to the decoder when predicting the next that it outperforms existing pipeline-based meth-
token in the sequence. During test time, we gener- ods when using entity hinting on GDA. Our method
ate the output using beam search (Graves, 2012). is competitive with existing methods when using
Beams are ranked by mean token log probability af- entity hinting on the CDR corpus but ultimately
ter applying a length penalty.4 Models were trained underperforms state-of-the-art results. Given that
and evaluated on a single NVIDIA Tesla V100.5 GDA is 46X larger, we speculated that our method
might be underperforming in the low-data regime.
5 Results To determine if this is a contributing factor, we
artificially reduce the size of the CDR and GDA
5.1 Comparison to existing methods training sets and plot the performance as a curve
In the following sections, we compare our model to (Figure 3). In all cases besides GDA with entity
existing DocRE methods on several benchmark cor- hinting, performance increases monotonically with
pora. We compare to existing pipeline-based meth- dataset size. There is no obvious plateau on CDR
ods (§5.1.1), including n-ary methods (§5.1.2), and even when using all 500 training examples. To-
end-to-end methods (§5.1.3). Details about these gether, these results suggest that our seq2seq based
methods are provided in Appendix G. approach can outperform existing pipeline-based
methods when there are sufficient training exam-
5.1.1 Existing pipeline-based methods
ples but underperforms relative to existing methods
In Table 1, we use entity hinting to compare our in the low-data regime.
method to existing pipeline-based methods on CDR
and GDA. We also report end-to-end performance, 5.1.2 n-ary relation extraction
which is not comparable to existing pipeline-based
In Table 2 we compare to existing n-ary meth-
methods but will facilitate future comparisons.
ods on the DGM corpus. With entity hinting,
The large performance improvement when using
our method significantly outperforms the existing
entity hinting (+27-29%) confirms that the model
method. The difference in encoders partially ex-
4
https://docs.allennlp. plains this large performance gap. Where Jia et al.
org/main/api/nn/beam_search/ (2019) use a BiLSTM that is trained from scratch,
#lengthnormalizedsequencelogprobabilityscorer
5
https://www.nvidia.com/en-us/data-center/ we use PubMedBERT, a much larger model that
v100/ has been pretrained on abstracts and full-text ar-
Table 2: Comparison to existing n-ary methods. Per- Table 3: Comparison to existing end-to-end methods.
formance reported as micro-precision, recall and F1- Performance reported as micro-precision, recall and F1-
scores (%) on the DGM validation set. Results below scores (%) on the DocRED test set. Results below the
the horizontal line are not comparable to existing meth- horizontal line are not comparable to existing methods.
ods. Bold: best scores. † Jia et al. 2019 do not report re- Bold: best scores.
sults on the validation set, so we re-run their paragraph-
level model. Method P R F1
JEREX (Eberts and Ulges, 2021) 42.8 38.2 40.4
Method P R F1 seq2rel (end-to-end) 44.0 33.8 38.2
Jia et al. (2019) † 62.9 76.2 68.9 seq2rel (end-to-end, relaxed) 53.7 41.3 46.7
seq2rel (entity hinting) 84.0 84.8 84.4
seq2rel (entity hinting, relaxed) 84.1 84.9 84.5 Table 4: Comparison of pipeline-based and end-to-end
seq2rel (end-to-end) 68.9 65.9 67.4 approaches. Gold hints use gold-standard entity anno-
seq2rel (end-to-end, relaxed) 78.3 74.9 76.6
tations to insert entity hints in the source text. Silver
hints use the entity annotations provided by PubTator.
ticles from PubMedCentral.6 However, this does Pipeline is identical to silver entity hints, except that
not completely account for the improvement in we filter out entity mentions predicted by our model
performance, as recent work that has replaced the that PubTator does not predict. The end-to-end model
BiLSTM encoder of (Jia et al., 2019) with Pub- only has access to the unannotated source text as input.
MedBERT found that it improves performance Performance reported as micro-precision, recall and F1-
scores (%) on the CDR test set, with strict and relaxed
by approximately 2-4% on the task of drug-gene- entity matching criteria. Bold: best scores.
mutation prediction (Zhang et al., 2021a).7 Our
results on the DGM corpus suggest that our lin- Strict Relaxed
earization schema effectively models n-ary rela- P R F1 P R F1
tions without requiring changes to the model archi-
Gold hints 68.2 66.2 67.2 68.2 66.2 67.2
tecture or training procedure.
Silver hints 42.4 37.3 39.7 53.0 46.7 49.7
5.1.3 End-to-end methods Pipeline 45.0 16.9 24.6 62.5 23.5 34.1
End-to-end 43.5 37.5 40.2 56.6 48.8 52.4
In Table 3 we compare to an existing end-to-end
approach on DocRED, JEREX (Eberts and Ulges,
2021). To make the comparison fair, we use the our model via entity hinting (referred to as “gold”
same pretrained encoder (BERTBASE ). We find that hints from here on, see §3.3). This allowed us
although our model is arguably simpler (JEREX to compare to existing methods that also provide
contains four task-specific sub-components, each these annotations as input. However, gold-standard
with its own loss) it only slightly underperforms entity annotations are (almost) never available in
JEREX, mainly due to recall. We speculate that real-world settings, such as large-scale extraction
one reason for this is a large number of relations on PubMed. In this setting, there are two strate-
per document, which leads to longer target strings gies: pipeline-based, where independent systems
and, therefore, more decoding steps. The median perform entity and relation extraction, and end-to-
length of the target strings in DocRED, using our end, where a single model performs both tasks. To
linearization, is 110, whereas the next largest is 19 compare these approaches under our model, we per-
in GDA. Improving the decoder’s ability to process form evaluations where a named entity recognition
long sequences, e.g. switching the LSTM for a (NER) system is used to determine entity hints (re-
transformer or modifying the linearization schema ferred to as “silver” hints from here on) and when
to produce shorter target strings, may improve re- no entity hints are provided (end-to-end).8 How-
call and close the gap with existing methods. ever, this alone does not create a true pipeline, as
our model can recover from both false negatives
5.2 Pipeline vs. End-to-end and false positives in the NER step. To mimic error
In §5.1.1 and §5.1.2, we provide gold-standard propagation in the pipeline setting, we filter any
entity annotations from each corpus as input to entity mention predicted by our model that was
not predicted by the NER system. In Table 4, we
6
https://www.ncbi.nlm.nih.gov/pmc/
7 8
The authors have not released code at the time of writ- Specifically, we use PubTator (Wei et al., 2013). PubTator
ing, so we were unable to evaluate this model on the DGM provides up-to-date entity annotations for PubMed using state-
validation set in order to compare with our method directly. of-the-art machine learning systems.
Table 5: Ablation study results. Performance reported tant. Finally, adding constraints to the decoding
as the micro-precision, recall and F1-scores (%) on the process (+ constrained decoding) has little impact
CDR and DocRED validation sets. ∆: difference to the
on performance, suggesting that a trained model
complete models F1-score. Bold: best scores.
rarely generates invalid target strings (see §3.2.3).
CDR DocRED
P R F1 ∆ P R F1 ∆ 6 Discussion
seq2rel (end-to-end) 41.0 35.1 37.8 – 46.9 36.1 40.8 –
- pretraining 9.4 6.9 8.0 -29.8 18.5 7.7 10.8 -30.0 6.1 Related work
- fine-tuning 24.3 20.5 22.2 -15.6 42.4 15.5 22.7 -18.1
- vocab restriction 39.6 32.2 35.5 -2.3 45.2 35.5 39.7 -1.1 Seq2seq learning for RE has been explored in
- sorting relations 36.1 29.2 32.3 -5.6 52.9 17.4 26.2 -14.7
+ constrained decoding 40.8 35.6 38.0 +0.2 46.8 35.9 40.6 -0.2 prior work. CopyRE (Zeng et al., 2018) uses an
encoder-decoder architecture with a copy mech-
present the results of all four settings (gold and sil- anism, similar to our approach, but is restricted
ver entity hints, pipeline and end-to-end) on CDR. to intra-sentence relations. Additionally, because
We find that using gold entity hints significantly CopyRE’s decoding proceeds for exactly three
outperforms all other settings. This is expected, timesteps per relation, the model is limited to gen-
as the gold-standard entity annotations are high- erating binary relations between single token en-
quality labels produced by domain experts. Using tities. The ability to decode multi-token entities
silver hints significantly drops performance, likely was addressed in follow-up work, CopyMTL (Zeng
due to a combination of false positive and false neg- et al., 2020). A similar approach was published con-
atives from the NER step. In the pipeline setting, currently but was again limited to intra-sentence
where there is no recovery from false negatives, per- binary relations (Nayak and Ng, 2020). Most re-
formance falls by another 15%. The end-to-end set- cently, GenerativeRE (Cao and Ananiadou, 2021)
ting significantly outperforms the pipeline setting proposed a novel copy mechanism to improve per-
(due to a large boost in recall) and performs compa- formance on multi-token entities. None of these
rably to using silver hints. Together, our results sug- approaches deal with the complexities of DocRE,
gest that performance reported using gold-standard where many relations cross the sentence boundary,
entity annotations may be overly optimistic and cor- and coreference resolution is critical.9
roborates previous work demonstrating the benefits More generally, our paper is related to a recently
of jointly learning entity and relation extraction proposed “text-to-text” framework (Raffel et al.,
(Miwa and Sasaki, 2014; Miwa and Bansal, 2016; 2020). In this framework, a task is formulated so
Gupta et al., 2016; Li et al., 2016a, 2017; Nguyen that the inputs and outputs are both text strings, en-
and Verspoor, 2019a; Yu et al., 2020). abling the use of the same model, loss function and
even hyperparameters across many seq2seq, classi-
5.3 Ablation fication and regression tasks. This framework has
In Table 5, we present the results of an ablation recently been applied to biomedical literature to
study. We perform the analysis twice, once on perform named entity recognition, relation extrac-
the biomedical corpus CDR and once on the gen- tion (binary, intra-sentence), natural language infer-
eral domain corpus DocRED. Unsurprisingly, we ence, and question answering (Phan et al., 2021).
find that fine-tuning a pretrained encoder greatly Our work can be seen as an attempt to formulate
impacts performance. Training the same encoder the task of DocRE within this framework.
from scratch (- pretraining) reduces performance
by ∼30%. Using the pretrained weights without 6.2 Limitations and future work
fine-tuning (- fine-tuning) drops performance by Permutation-sensitive loss Our approach
15.6-18.1%. Restricting the target vocabulary (- adopts the sequence cross-entropy loss (Equa-
vocab restriction, see §3.2.1) has a small positive tion 2), which is sensitive to the order of predicted
impact, boosting performance by 1.1%-2.3%. De- tokens, enforcing an unnecessary decoding
liberately ordering the relations within each target order on the inherently unordered relations. To
string (- sorting relations, see §3.2.4) has a large partially mitigate this problem, we order relations
positive impact, boosting performance by 5.6%-
9
14.7%. This effect is larger on DocRED, likely Concurrent to our work, REBEL (Huguet Cabot and Nav-
igli, 2021) also extends seq2seq methods to document-level
because it has more relations per document on av- RE, achieving strong performance on DocRED. However, the
erage than CDR, so ordering becomes more impor- method was not evaluated on n-ary relations.
within the target string according to order of (www.computeontario.ca), Compute Canada
appearance in the source text, providing the (www.computecanada.ca) and the CIFAR AI
model with a consistent decoding order that can Chairs Program and partially funded by the
be learned (see §3.2.4, §5.3). Previous work US National Institutes of Health (NIH) [U41
has addressed this issue with various strategies, HG006623, U41 HG003751).
including reinforcement learning (Zeng et al.,
2019), unordered-multi-tree decoders (Zhang
et al., 2020), and non-autoregressive decoders (Sui References
et al., 2020). However, these works are limited to Takuya Akiba, Shotaro Sano, Toshihiko Yanase,
binary intra-sentence relation extraction, and their Takeru Ohta, and Masanori Koyama. 2019. Op-
tuna: A next-generation hyperparameter optimiza-
suitability for DocRE has not been explored. A
tion framework. In Proceedings of the 25th ACM
promising future direction would be to modify our SIGKDD International Conference on Knowledge
approach such that the arbitrary order of relations Discovery & Data Mining, KDD 2019, Anchor-
is not enforced during training. age, AK, USA, August 4-8, 2019, pages 2623–2631.
ACM.
Input length restriction Due to the pretrained
Giannis Bekoulis, Johannes Deleu, Thomas Demeester,
encoder’s input size limit (512 tokens), our ex- and Chris Develder. 2018. Joint entity recogni-
periments are conducted on paragraph-length text. tion and relation extraction as a multi-head selection
Our model could be extended to full documents problem. Expert Systems with Applications, 114:34–
by swapping its encoder with any of the recently 45.
proposed “efficient transformers” (Tay et al., 2021). James Bergstra, Rémi Bardenet, Yoshua Bengio, and
Future work could evaluate such a model’s ability Balázs Kégl. 2011. Algorithms for hyper-parameter
to extract relations from full scientific papers. optimization. In Advances in Neural Information
Processing Systems 24: 25th Annual Conference on
Pretraining the decoder In our model, the en- Neural Information Processing Systems 2011. Pro-
coder is pretrained, while the decoder is trained ceedings of a meeting held 12-14 December 2011,
Granada, Spain, pages 2546–2554.
from scratch. Several recent works, such as T5
(Raffel et al., 2020) and BART (Lewis et al., Jiarun Cao and Sophia Ananiadou. 2021. Generati-
2020), have proposed pretraining strategies for en- veRE: Incorporating a novel copy mechanism and
tire encoder-decoder architectures, which can be pretrained model for joint entity and relation extrac-
tion. In Findings of the Association for Computa-
fine-tuned on downstream tasks. An interesting tional Linguistics: EMNLP 2021, pages 2119–2126,
future direction would be to fine-tune such a model Punta Cana, Dominican Republic. Association for
on DocRE using our linearization schema. Computational Linguistics.
Fenia Christopoulou, Makoto Miwa, and Sophia Ana-

7 Conclusion niadou. 2019. Connecting the dots: Document-level
In this paper, we extend generative, seq2seq meth- neural relation extraction with edge-oriented graphs.
In Proceedings of the 2019 Conference on Empirical
ods for relation extraction to the document level. Methods in Natural Language Processing and the
We propose a novel linearization schema that 9th International Joint Conference on Natural Lan-
can handle complexities overlooked by previous guage Processing (EMNLP-IJCNLP), pages 4925–
seq2seq approaches, like coreferent mentions and 4936, Hong Kong, China. Association for Computa-
tional Linguistics.
n-ary relations. We compare our approach to ex-
isting pipeline-based and end-to-end methods on Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
several benchmark corpora, in some cases exceed- Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
ing their performance. In future work, we hope
standing. In Proceedings of the 2019 Conference
to extend our method to full scientific papers and of the North American Chapter of the Association
develop strategies to improve performance in the for Computational Linguistics: Human Language
low-data regime and in cases where there are many Technologies, Volume 1 (Long and Short Papers),
relations per document. pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
Acknowledgements Markus Eberts and Adrian Ulges. 2021. An end-to-end
model for entity-level relation extraction using multi-
This research was enabled in part by instance learning. In Proceedings of the 16th Con-
support provided by Compute Ontario ference of the European Chapter of the Association
for Computational Linguistics: Main Volume, pages generation. In Findings of the Association for Com-
3650–3660, Online. Association for Computational putational Linguistics: EMNLP 2021, pages 2370–
Linguistics. 2381, Punta Cana, Dominican Republic. Associa-
tion for Computational Linguistics.
Matt Gardner, Joel Grus, Mark Neumann, Oyvind
Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Pe- Sarthak Jain, Madeleine van Zuylen, Hannaneh Ha-
ters, Michael Schmitz, and Luke Zettlemoyer. 2018. jishirzi, and Iz Beltagy. 2020. SciREX: A chal-
AllenNLP: A deep semantic natural language pro- lenge dataset for document-level information extrac-
cessing platform. In Proceedings of Workshop for tion. In Proceedings of the 58th Annual Meeting
NLP Open Source Software (NLP-OSS), pages 1– of the Association for Computational Linguistics,
6, Melbourne, Australia. Association for Computa- pages 7506–7516, Online. Association for Compu-
tional Linguistics. tational Linguistics.
John Giorgi, Xindi Wang, Nicola Sahar, Won Young Robin Jia, Cliff Wong, and Hoifung Poon. 2019.
Shin, Gary D Bader, and Bo Wang. 2019. End-to- Document-level n-ary relation extraction with multi-
end named entity recognition and relation extraction scale representation learning. In Proceedings of the
using pre-trained language models. ArXiv preprint, 2019 Conference of the North American Chapter of
abs/1912.13415. the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long and
Alex Graves. 2012. Sequence transduction with Short Papers), pages 3693–3704, Minneapolis, Min-
recurrent neural networks. ArXiv preprint, nesota. Association for Computational Linguistics.
abs/1211.3711.
Jin-Dong Kim, T. Ohta, Yuka Tateisi, and Jun’ichi Tsu-
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. jii. 2003. Genia corpus - a semantically annotated
Li. 2016a. Incorporating copying mechanism in corpus for bio-textmining. Bioinformatics, 19 Suppl
sequence-to-sequence learning. In Proceedings of 1:i180–2.
the 54th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers), Po-Ting Lai and Zhiyong Lu. 2021. Bert-gt: Cross-
pages 1631–1640, Berlin, Germany. Association for sentence n-ary relation extraction with bert and
Computational Linguistics. graph transformer. Bioinformatics.
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-

Jinghang Gu, Longhua Qian, and Guodong Zhou. jan Ghazvininejad, Abdelrahman Mohamed, Omer
2016b. Chemical-induced disease relation extrac- Levy, Veselin Stoyanov, and Luke Zettlemoyer.
tion with various linguistic features. Database: 2020. BART: Denoising sequence-to-sequence pre-
The Journal of Biological Databases and Curation, training for natural language generation, translation,
2016. and comprehension. In Proceedings of the 58th An-
nual Meeting of the Association for Computational
Jinghang Gu, Fuqing Sun, Longhua Qian, and Linguistics, pages 7871–7880, Online. Association
Guodong Zhou. 2017. Chemical-induced disease re- for Computational Linguistics.
lation extraction via convolutional neural network.
Database: The Journal of Biological Databases and Fei Li, Meishan Zhang, Guohong Fu, and Donghong Ji.
Curation, 2017. 2017. A neural joint model for entity and relation ex-
traction from biomedical text. BMC bioinformatics,
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, 18(1):198.
Naoto Usuyama, Xiaodong Liu, Tristan Naumann,
Jianfeng Gao, and Hoifung Poon. 2020. Domain- Fei Li, Yue Zhang, Meishan Zhang, and Donghong
specific language model pretraining for biomedi- Ji. 2016a. Joint models for extracting adverse drug
cal natural language processing. ArXiv preprint, events from biomedical text. In Proceedings of the
abs/2007.15779. Twenty-Fifth International Joint Conference on Arti-
ficial Intelligence, IJCAI 2016, New York, NY, USA,
Pankaj Gupta, Hinrich Schütze, and Bernt Andrassy. 9-15 July 2016, pages 2838–2844. IJCAI/AAAI
2016. Table filling multi-task recurrent neural net- Press.
work for joint entity and relation extraction. In Pro-
ceedings of COLING 2016, the 26th International Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sci-
Conference on Computational Linguistics: Techni- aky, Chih-Hsuan Wei, Robert Leaman, Allan Peter
cal Papers, pages 2537–2547, Osaka, Japan. The Davis, Carolyn J. Mattingly, Thomas C. Wiegers,
COLING 2016 Organizing Committee. and Zhiyong Lu. 2016b. Biocreative v cdr task cor-
pus: a resource for chemical disease relation extrac-
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long tion. ArXiv preprint, abs/d.
short-term memory. Neural Computation, 9:1735–
1780. Feifan Liu, Jinying Chen, Abhyuday Jagannatha, and
Hong Yu. 2016. Learning for biomedical informa-
Pere-Lluís Huguet Cabot and Roberto Navigli. 2021. tion extraction: Methodological review of recent ad-
REBEL: Relation extraction by end-to-end language vances. ArXiv preprint, abs/1606.07993.
Ilya Loshchilov and Frank Hutter. 2019. Decou- Adam Paszke, Sam Gross, Soumith Chintala, Gregory
pled weight decay regularization. In 7th Inter- Chanan, Edward Yang, Zachary DeVito, Zeming
national Conference on Learning Representations, Lin, Alban Desmaison, Luca Antiga, and Adam
ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Lerer. 2017. Automatic differentiation in PyTorch.
OpenReview.net. In NIPS Autodiff Workshop.
Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina
Hajishirzi. 2018. Multi-task identification of enti- Toutanova, and Wen-tau Yih. 2017. Cross-sentence
ties, relations, and coreference for scientific knowl- n-ary relation extraction with graph LSTMs. Trans-
edge graph construction. In Proceedings of the 2018 actions of the Association for Computational Lin-
Conference on Empirical Methods in Natural Lan- guistics, 5:101–115.
guage Processing, pages 3219–3232, Brussels, Bel-
gium. Association for Computational Linguistics. Long N Phan, James T Anibal, Hieu Tran, Shaurya
Chanana, Erol Bahadroglu, Alec Peltekian, and Gré-
Hieu Minh Tran, Minh Trung Nguyen, and Thien Huu goire Altan-Bonnet. 2021. Scifive: a text-to-text
Nguyen. 2020. The dots have their values: Exploit- transformer model for biomedical literature. ArXiv
ing the node-edge connections in graph-based neural preprint, abs/2106.03598.
models for document-level relation extraction. In
Findings of the Association for Computational Lin- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
guistics: EMNLP 2020, pages 4561–4567, Online. Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Association for Computational Linguistics. Wei Li, and Peter J Liu. 2020. Exploring the lim-
its of transfer learning with a unified text-to-text
Makoto Miwa and Mohit Bansal. 2016. End-to-end re- transformer. Journal of Machine Learning Research,
lation extraction using LSTMs on sequences and tree 21:1–67.
structures. In Proceedings of the 54th Annual Meet-
ing of the Association for Computational Linguistics Linfeng Song, Yue Zhang, Zhiguo Wang, and Daniel
(Volume 1: Long Papers), pages 1105–1116, Berlin, Gildea. 2018. N-ary relation extraction using graph-
Germany. Association for Computational Linguis- state LSTM. In Proceedings of the 2018 Conference
tics. on Empirical Methods in Natural Language Process-
ing, pages 2226–2235, Brussels, Belgium. Associa-
Makoto Miwa and Yutaka Sasaki. 2014. Modeling tion for Computational Linguistics.
joint entity and relation extraction with table repre-
sentation. In Proceedings of the 2014 Conference on Nitish Srivastava, Geoffrey E. Hinton, Alex
Empirical Methods in Natural Language Processing Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-
(EMNLP), pages 1858–1869, Doha, Qatar. Associa- nov. 2014. Dropout: a simple way to prevent neural
tion for Computational Linguistics. networks from overfitting. J. Mach. Learn. Res.,
15:1929–1958.
Guoshun Nan, Zhijiang Guo, Ivan Sekulic, and Wei Lu.
2020. Reasoning with latent structure refinement for Dianbo Sui, Yubo Chen, Kang Liu, Jun Zhao, Xian-
document-level relation extraction. In Proceedings grong Zeng, and Shengping Liu. 2020. Joint entity
of the 58th Annual Meeting of the Association for and relation extraction with set prediction networks.
Computational Linguistics, pages 1546–1557, On- ArXiv preprint, abs/2011.01675.
line. Association for Computational Linguistics.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Tapas Nayak and Hwee Tou Ng. 2020. Effective mod- Sequence to sequence learning with neural networks.
eling of encoder-decoder architecture for joint en- In Advances in Neural Information Processing Sys-
tity and relation extraction. In The Thirty-Fourth tems 27: Annual Conference on Neural Informa-
AAAI Conference on Artificial Intelligence, AAAI tion Processing Systems 2014, December 8-13 2014,
2020, The Thirty-Second Innovative Applications of Montreal, Quebec, Canada, pages 3104–3112.
Artificial Intelligence Conference, IAAI 2020, The
Tenth AAAI Symposium on Educational Advances Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang
in Artificial Intelligence, EAAI 2020, New York, NY, Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu
USA, February 7-12, 2020, pages 8528–8535. AAAI Yang, Sebastian Ruder, and Donald Metzler. 2021.
Press. Long range arena : A benchmark for efficient trans-
formers. In 9th International Conference on Learn-
Dat Quoc Nguyen and Karin Verspoor. 2019a. End- ing Representations, ICLR 2021, Virtual Event, Aus-
to-end neural relation extraction using deep biaffine tria, May 3-7, 2021. OpenReview.net.
attention. In Advances in Information Retrieval,
pages 729–738, Cham. Springer, Springer Interna- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
tional Publishing. Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Dat Quoc Nguyen and Karin Verspoor. 2019b. End- you need. In Advances in Neural Information Pro-
to-end neural relation extraction using deep biaffine cessing Systems 30: Annual Conference on Neural
attention. In European Conference on Information Information Processing Systems 2017, December 4-
Retrieval, pages 729–738. Springer. 9, 2017, Long Beach, CA, USA, pages 5998–6008.
Patrick Verga, Emma Strubell, and Andrew McCallum. learning approach for extracting gene-disease associ-
2018. Simultaneously self-attending to all mentions ations from literature. In Research in Computational
for full-abstract biological relation extraction. In Molecular Biology, pages 272–284, Cham. Springer
Proceedings of the 2018 Conference of the North International Publishing.
American Chapter of the Association for Compu-
tational Linguistics: Human Language Technolo- Benfeng Xu, Quan Wang, Yajuan Lyu, Yong Zhu, and
gies, Volume 1 (Long Papers), pages 872–884, New Zhendong Mao. 2021. Entity structure within and
Orleans, Louisiana. Association for Computational throughout: Modeling mention dependencies for
Linguistics. document-level relation extraction. In AAAI.
Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Pengcheng Yang, Fuli Luo, Shuming Ma, Junyang Lin,
2016. Order matters: Sequence to sequence for sets. and Xu Sun. 2019. A deep reinforced sequence-to-
In 4th International Conference on Learning Repre- set model for multi-label classification. In Proceed-
sentations, ICLR 2016, San Juan, Puerto Rico, May ings of the 57th Annual Meeting of the Association
2-4, 2016, Conference Track Proceedings. for Computational Linguistics, pages 5252–5258,
Florence, Italy. Association for Computational Lin-
David Wadden, Ulme Wennberg, Yi Luan, and Han- guistics.
naneh Hajishirzi. 2019. Entity, relation, and event
extraction with contextualized span representations. Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin,
In Proceedings of the 2019 Conference on Empirical Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou,
Methods in Natural Language Processing and the and Maosong Sun. 2019. DocRED: A large-scale
9th International Joint Conference on Natural Lan- document-level relation extraction dataset. In Pro-
guage Processing (EMNLP-IJCNLP), pages 5784– ceedings of the 57th Annual Meeting of the Associa-
5789, Hong Kong, China. Association for Computa- tion for Computational Linguistics, pages 764–777,
tional Linguistics. Florence, Italy. Association for Computational Lin-
guistics.
Li Wan, Matthew D. Zeiler, Sixin Zhang, Yann LeCun,
Bowen Yu, Zhenyu Zhang, Xiaobo Shu, Tingwen Liu,
and Rob Fergus. 2013. Regularization of neural net-
Yubin Wang, Bin Wang, and Sujian Li. 2020. Joint
works using dropconnect. In Proceedings of the
extraction of entities and relations based on a novel
30th International Conference on Machine Learning,
decomposition strategy. In ECAI 2020, pages 2282–
ICML 2013, Atlanta, GA, USA, 16-21 June 2013,
2289. IOS Press.
volume 28 of JMLR Workshop and Conference Pro-
ceedings, pages 1058–1066. JMLR.org. Daojian Zeng, Haoran Zhang, and Qianying Liu. 2020.
Copymtl: Copy mechanism for joint extraction of
Yucheng Wang, Bowen Yu, Hongsong Zhu, Tingwen entities and relations with multi-task learning. In
Liu, Nan Yu, and Limin Sun. 2021. Discontinu- The Thirty-Fourth AAAI Conference on Artificial In-
ous named entity recognition as maximal clique dis- telligence, AAAI 2020, The Thirty-Second Innova-
covery. In Proceedings of the 59th Annual Meet- tive Applications of Artificial Intelligence Confer-
ing of the Association for Computational Linguistics ence, IAAI 2020, The Tenth AAAI Symposium on Ed-
and the 11th International Joint Conference on Nat- ucational Advances in Artificial Intelligence, EAAI
ural Language Processing (Volume 1: Long Papers), 2020, New York, NY, USA, February 7-12, 2020,
pages 764–774, Online. Association for Computa- pages 9507–9514. AAAI Press.
tional Linguistics.
Xiangrong Zeng, Shizhu He, Daojian Zeng, Kang Liu,
Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. 2013. Shengping Liu, and Jun Zhao. 2019. Learning the
Pubtator: a web-based text mining tool for assisting extraction order of multiple relational facts in a sen-
biocuration. Nucleic acids research, 41(W1):W518– tence with reinforcement learning. In Proceedings
W522. of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th Interna-
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien tional Joint Conference on Natural Language Pro-
Chaumond, Clement Delangue, Anthony Moi, Pier- cessing (EMNLP-IJCNLP), pages 367–377, Hong
ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- Kong, China. Association for Computational Lin-
icz, Joe Davison, Sam Shleifer, Patrick von Platen, guistics.
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame, Xiangrong Zeng, Daojian Zeng, Shizhu He, Kang Liu,
Quentin Lhoest, and Alexander Rush. 2020. Trans- and Jun Zhao. 2018. Extracting relational facts by
formers: State-of-the-art natural language process- an end-to-end neural model with copy mechanism.
ing. In Proceedings of the 2020 Conference on Em- In Proceedings of the 56th Annual Meeting of the
pirical Methods in Natural Language Processing: Association for Computational Linguistics (Volume
System Demonstrations, pages 38–45, Online. Asso- 1: Long Papers), pages 506–514, Melbourne, Aus-
ciation for Computational Linguistics. tralia. Association for Computational Linguistics.
Ye Wu, Ruibang Luo, Henry C. M. Leung, Hing-Fung Ranran Haoran Zhang, Qianying Liu, Aysa Xuemo
Ting, and Tak-Wah Lam. 2019. Renet: A deep Fan, Heng Ji, Daojian Zeng, Fei Cheng, Daisuke
Kawahara, and Sadao Kurohashi. 2020. Minimize fraction of inter-sentence relations in DocRED is
exposure bias of Seq2Seq models in joint entity reported as ∼40.7%. We can reproduce this value
and relation extraction. In Findings of the Associa-
if we consider relations intra-sentence when all
tion for Computational Linguistics: EMNLP 2020,
pages 236–246, Online. Association for Computa- mentions of an entity exist within a single sentence
tional Linguistics. and inter-sentence otherwise.
Sheng Zhang, Cliff Wong, Naoto Usuyama, Sarthak C Hypernym filtering
Jain, Tristan Naumann, and Hoifung Poon. 2021a.
Modular self-supervision for document-level rela-
The CDR dataset is annotated for chemical-induced
tion extraction. In Proceedings of the 2021 Con-
ference on Empirical Methods in Natural Language disease (CID) relationships between the most
Processing, pages 5291–5302, Online and Punta specific chemical and disease mentions in an ab-
Cana, Dominican Republic. Association for Compu- stract. Take the following example from the corpus:
tational Linguistics.
Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q. Carbamazepine-induced cardiac dysfunction [...] A patient
Weinberger, and Yoav Artzi. 2021b. Revisiting few- with sinus bradycardia and atrioventricular block, induced
sample BERT fine-tuning. In 9th International Con- by carbamazepine, prompted an extensive literature review
ference on Learning Representations, ICLR 2021,
of all previously reported cases.
Virtual Event, Austria, May 3-7, 2021. OpenRe-
view.net.
In this example (PMID: 1728915), only (carba-
Deyu Zhou, Dayou Zhong, and Yulan He. 2014.
Biomedical relation extraction: from binary to com-
mazepine, bradycardia) and (carbamazepine, atri-
plex. Computational and mathematical methods in oventricular block) are labelled as true relations.
medicine, 2014. The relation (carbamazepine, cardiac dysfunction),
although true, is not labelled as cardiac dysfunction
Wenxuan Zhou and Muhao Chen. 2021. An im-
proved baseline for sentence-level relation extrac- is a hypernym of both bradycardia and atrioventric-
tion. ArXiv, abs/2102.01373. ular block. This can harm evaluation performance,
as the prediction (carbamazepine, cardiac dysfunc-
Wenxuan Zhou, Kevin Huang, Tengyu Ma, and Jing
Huang. 2021. Document-level relation extraction tion) will be considered a false positive. There-
with adaptive thresholding and localized context fore, we follow previous work (Gu et al., 2016b,
pooling. In AAAI. 2017; Verga et al., 2018; Christopoulou et al., 2019;
Zhou et al., 2021) by filtering negative relations like
these, with disease entities that are hypernyms of a
A Constrained decoding
corresponding true relations disease entity within
In Figure 4, we illustrate the rules used to constrain the same abstract, according to the hierarchy in the
decoding. At each timestep t, given the prediction MeSH vocabulary.10
of the previous timestep t − 1, the predicted class
probabilities of tokens that would generate a syn- D Parsing the models output
tactically invalid target string are set to a tiny value.
At test time, our model autoregressively generates
In practice, we found that a model rarely generates
an output, token-by-token, using beam search de-
invalid target strings, so these constraints have little
coding (see §3.2). In order to extract the predicted
effect on final performance (see §3.2.3 and §5.3).
relations from this output, we apply the following
B Details about dataset annotations steps. First, predicted token ids are converted
to a string. We use the decode()11 method of
In Table 6, we list which complexities (e.g. nested the HuggingFace Transformers tokenizer (Wolf
& discontinuous mentions, n-ary relations) are con- et al., 2020) to do this. For example, after calling
tained within each dataset used in our evaluations. decode() on the predicted token ids, this string
We also report the fraction of relations in the test might look like:
set that are inter-sentence. We consider a relation
intra-sentence if any sentence in the document con- monoamine oxidase b ; maob @GENE@ parkinson’s
tains at least one mention of each entity in the
relation, and inter-sentence otherwise. This pro- https://meshb.nlm.nih.gov
10
https://huggingface.co/docs/transformers/
11
duces an estimate that matches previously reported main_classes/tokenizer#transformers.
numbers for CDR (∼30%). In Yao et al. (2019), the PreTrainedTokenizerBase.decode
Figure 4: A diagram depicting syntactically valid predictions during decoding at each timestep t. The log proba-
bilities of all other possible predictions are set to a tiny value to prevent the model from producing a syntactically
invalid target string. BOS is the special beginning-of-sequence token, COPY denotes any token copied from the
source text, and COREF is the special token used to separate coreferent mentions (i.e. ;). ENTITY is any special
entity token (e.g. @GENE@) and RELATION any special relation token (e.g. @GDA@ for gene-disease association). n̂ents
denotes the number of entities predicted by the current timestep and nents the expected arity of the relation. The
special end-of-sequence token (not shown) is always considered valid and its log probability is never modified.
Table 6: Evaluation datasets used in this paper with details about their annotations. Inter-sentence relations (%) are
the fraction of relations in the test set that cross sentence boundaries. We consider a relation intra-sentence if any
sentence in the document contains at least one mention of each entity in the relation, and inter-sentence otherwise.
*This differs from the estimate in Yao et al. (2019), see Appendix B.
Corpus Nested Mentions? Discontinuous Mentions? Coreferent mentions? n-ary relations? Inter-sentence relations (%)
CDR (Li et al., 2016b) 3 3 3 7 29.8
GDA (Wu et al., 2019) 3 7 3 7 15.6
DGM (Jia et al., 2019) 7 7 3 3 63.5
DocRED (Yao et al., 2019) 7 7 3 7 12.5*
disease ; pd @DISEASE@ @GDA@ E Relaxed entity matching

The aim of DocRE is to extract relations at the en-
We then use regular expressions to extract any rela- tity-level. However, it is common to evaluate these
tions from this string that match our linearization methods with a “strict” matching criteria, where a
schema (see §3.1), which produces a dictionary of predicted entity P is considered correct if and only
nested lists, keyed by relation class: if all its mentions exactly match a corresponding
gold entities mentions, i.e. P = G. This penalizes
{
model predictions that miss even a single corefer-
"GDA": [ ent mention, but are otherwise correct. A relaxed
[ criteria, proposed in prior work (Jain et al., 2020)
[["monoamine oxidase b", "maob"], "GENE"],
[["parkinson's disease", "pd"], "DISEASE"] considers P to match G if more than 50% of P’s
] mentions belong to G, that is
]
} |P ∩ G|
> 0.5
Finally, we apply some normalization steps to the |P|
entity mentions. Namely, we strip leading and trail- In this paper, alongside the strict criteria, we re-
ing white space characters, sort entity mentions port performance using this relaxed entity matching
lexicographically (as their order is not important), strategy, denoted “relaxed”.
and remove duplicate mentions. Similarly, we re-
F Hyperparameters
move duplicate relations. These steps are applied to
both target and model output strings. The F1-score In Table 7, we list the hyperparameter values used
can then be computed by tallying true positives, during evaluation on each corpus, with and without
false positives and false negatives. entity hinting. Select hyperparameters were tuned
Table 7: Hyperparameter values used for each corpus. Hyperparameters values when using entity hinting, if they
differ from the values used without entity hinting, are shown in parentheses. Tuned indicates whether or not the
hyperparameters were tuned on the validation sets.
Hyperparameter Tuned? CDR GDA DGM DocRED

Batch size 3 4 4 4 4
Training epochs 3 130 (70) 30 (25) 30 (45) 50
Encoder learning rate 7 2e-5 2e-5 2e-5 2e-5
Encoder weight decay 7 0.01 0.01 0.01 0.01
Encoder re-initialized top L layers 3 1 1 (2) 1 1
Decoder learning rate 3 1.21e-4 (1.13e-4) 5e-4 (4e-4) 8e-4 (1.5e-5) 7.8e-5
Decoder input dropout 7 0.1 0.1 0.1 0.1
Decoder hidden-to-hidden weights dropout 7 0.5 0.5 0.5 0.5
Target embedding size 7 256 256 256 256
No. heads in multi-head cross-attention 7 6 6 6 6
Beam size 3 3 (2) 4 (1) 3 (2) 8
Length penalty 3 1.4 (0.2) 0.8 (1.0) 0.2 (0.8) 1.4
Max decoding steps 7 128 96 96 400
using Optuna (Akiba et al., 2019). The tuning not. We compare to EoG in the pipeline-based
process selects the best hyperparameters accord- setting on the CDR and GDA corpora.
ing to the validation set micro F1-score using the
TPE (Tree-structured Parzen Estimator) algorithm • Nan et al. (2020) propose LSR (Latent Struc-
(Bergstra et al., 2011).12 During tuning, we use ture Refinement). A “node constructor” en-
greedy decoding (i.e. beam size of one). Once opti- codes each sentence of an input document and
mal hyperparameters are found, we tune the beam outputs contextual representations. Represen-
size (bs) and length penalty (α) using a grid search tations that correspond to mentions and tokens
over the values bs = {2...10}, with a step size of on the shortest dependency path in a sentence
1, and α = {0.2...2.0}, with a step size of 0.2. are extracted as nodes. A “dynamic reasoner”
is then applied to induce a document-level
G Baselines graph based on the extracted nodes. The clas-
This section contains detailed descriptions of all sifier uses the final representations of nodes
methods we compare to in this paper. for relation classification. We compare to LSR
in the pipeline-based setting on the CDR and
G.1 Pipeline-based methods GDA corpora.
These methods are pipeline-based, assuming the en-
tities are provided as input. Many of them construct • Lai and Lu (2021) propose BERT-GT, which
a document-level graph using dependency parsing, combines BERT with a graph transformer.
heuristics, or structured attention and then update Both BERT and the graph transformer accept
node and edge representations using propagation. the document text as input, but the graph trans-
former requires the neighbouring positions for
• Christopoulou et al. (2019) propose EoG, an each token, and the self-attention mechanism
edge-orientated graph neural model. The is replaced with a neighbour–attention mecha-
nodes of the graph are constructed from men- nism. The hidden states of the two transform-
tions, entities, and sentences. Edges between ers are aggregated before classification. We
nodes are initially constructed using heuristics. compare to BERT-GT in the pipeline-based
An iterative algorithm is then used to generate setting on the CDR and GDA corpora.
edges between nodes in the graph. Finally,
a classification layer takes the representation • Minh Tran et al. (2020) propose EoGANE
of entity-to-entity edges as input to determine (EoG model Augmented with Node Represen-
whether those entities express a relation or tations), which extends the edge-orientated
model proposed by Christopoulou et al. (2019)
https://optuna.readthedocs.io/en/stable/
12
reference/generated/optuna.samplers.TPESampler. to include explicit node representations which

html are used during relation classification. We
compare to EoGANE in the pipeline-based
setting on the CDR and GDA corpora.
• SSAN (Xu et al., 2021) propose SSAN (Struc-

tured Self-Attention Network), which inherits
the architecture of the transformer encoder
(Vaswani et al., 2017) but adds a novel struc-
tured self-attention mechanism to model the
coreference and co-occurrence dependencies
between an entities mentions. We compare
to SSAN in the pipeline-based setting on the
CDR and GDA corpora.
• Zhou et al. (2021) propose ALTOP (Adaptive

Thresholding and Localized cOntext Pooling),
which extends BERT with two modifications.
Adaptive thresholding, which learns an opti-
mal threshold to apply to the relation classifier.
Localized context pooling, which uses the pre-
trained self-attention layers of BERT to create
an entity embedding from its mentions and
their context. We compare to ALTOP in the
pipeline-based setting on the CDR and GDA
corpora.
G.2 n-ary relation extraction

These methods are explicitly designed for the ex-
traction of n-ary relations, where n > 2.
• Jia et al. (2019) propose a multiscale neural

architecture, which combines representations
learned over text spans of varying scales and
for various sub-relations. We compare to Jia
et al. (2019) in the pipeline-based setting on
the n-ary DGM corpus.
G.3 End-to-end methods

These methods are capable of performing the sub-
tasks of DocRE in an end-to-end fashion with only
the document text as input.
• Eberts and Ulges (2021) propose JEREX,

which extends BERT with four task-specific
components that use BERTs outputs to per-
form entity mention localization, coreference
resolution, entity classification, and relation
classification. They present two versions of
their relation classifier, denoted “global re-
lation classifier” (GRC) and “multi-instance
relation classifier” (MRC). We compare to
JEREX-MRC in the end-to-end setting on the
DocRED corpus.

Seq 2 Rel

Uploaded by

Copyright:

Available Formats

Seq 2 Rel

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Seq 2 Rel

Uploaded by

Copyright:

Available Formats

A sequence-to-sequence approach for document-level relation extraction

John Giorgi1,4,5, Gary D. Bader1,2,4,6,7,† Bo Wang1,3,5,8,†

Abstract this massive volume of text. An important step in

Complexities Example Comment

special token denoting their type (e.g. @GDA@). Two

decoder is initialized with the mean of token em-

Fenia Christopoulou, Makoto Miwa, and Sophia Ana-

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-

disease ; pd @DISEASE@ @GDA@ E Relaxed entity matching

Hyperparameter Tuned? CDR GDA DGM DocRED

reference/generated/optuna.samplers.TPESampler. to include explicit node representations which

• SSAN (Xu et al., 2021) propose SSAN (Struc-

• Zhou et al. (2021) propose ALTOP (Adaptive

G.2 n-ary relation extraction

• Jia et al. (2019) propose a multiscale neural

G.3 End-to-end methods

• Eberts and Ulges (2021) propose JEREX,

You might also like