Enriching BERT With Knowledge Graph Embeddings For Document Classification
Enriching BERT With Knowledge Graph Embeddings For Document Classification
Enriching BERT With Knowledge Graph Embeddings For Document Classification
Classification
Malte Ostendorff1,2 , Peter Bourgonje1 , Maria Berger1 ,
Julián Moreno-Schneider1 , Georg Rehm1 , Bela Gipp2
1
Speech and Language Technology, DFKI GmbH, Germany
first.last@dfki.de
2
University of Konstanz, Germany
first.last@uni-konstanz.de
blurbs) and additional metadata. Building Deep neural language models have recently
upon BERT, a deep neural language model, we evolved to a successful method for representing
demonstrate how to combine text representa- text. In particular, Bidirectional Encoder Rep-
tions with metadata and knowledge graph em- resentations from Transformers (BERT; Devlin
beddings, which encode author information. et al., 2019) outperformed previous state-of-the-
Compared to the standard BERT approach we art methods by a large margin on various NLP
achieve considerably better results for the clas-
tasks. We adopt BERT for text-based classifica-
sification task. For a more coarse-grained clas-
sification using eight labels we achieve an F1- tion and extend the model with additional meta-
score of 87.20, while a detailed classification data provided in the context of the shared task,
using 343 labels yields an F1-score of 64.70. such as author, publisher, publishing date, etc.
We make the source code and trained models A key contribution of this paper is the inclu-
of our experiments publicly available. sion of additional (meta) data using a state-of-the-
art approach for text processing. Being a transfer
1 Introduction
learning approach, it facilitates the task solution
With ever-increasing amounts of data available, with external knowledge for a setup in which rela-
there is an increase in the need to offer tooling tively little training data is available. More pre-
to speed up processing, and eventually making cisely, we enrich BERT, as our pre-trained text
sense of this data. Because fully-automated tools representation model, with knowledge graph em-
to extract meaning from any given input to any beddings that are based on Wikidata (Vrandecic
desired level of detail have yet to be developed, and Krötzsch, 2014), add metadata provided by
this task is still at least supervised, and often (par- the shared task organisers (title, author(s), publish-
tially) resolved by humans; we refer to these hu- ing date, etc.) and collect additional information
mans as knowledge workers. Knowledge work- on authors for this particular document classifica-
ers are professionals that have to go through large tion task. As we do not rely on text-based fea-
amounts of data and consolidate, prepare and pro- tures alone but also utilize document metadata, we
cess it on a daily basis. This data can originate consider this as a document classification problem.
from highly diverse portals and resources and de- The proposed approach is an attempt to solve this
pending on type or category, the data needs to be problem exemplary for single dataset provided by
channelled through specific down-stream process- the organisers of the shared task.
ing pipelines. We aim to create a platform for cu-
ration technologies that can deal with such data 2 Related Work
from diverse sources and that provides natural lan- A central challenge in work on genre classification
guage processing (NLP) pipelines tailored to par- is the definition of a both rigid (for theoretical pur-
ticular content types and genres, rendering this ini- poses) and flexible (for practical purposes) mode
tial classification an important sub-task. 1
https://www.inf.uni-hamburg.
In this paper, we work with the dataset of the de/en/inst/ab/lt/resources/data/
2019 GermEval shared task on hierarchical text germeval-2019-hmc.html
of representation that is able to model various di- outperform CNNs, LSTMs and SVMs in this area.
mensions and characteristics of arbitrary text gen- They use the Web of Science (WOS) dataset and
res. The size of the challenge can be illustrated introduce a new real-world scenario dataset called
by the observation that there is no clear agreement Blurb Genre Collection (BGC)2 .
among researchers regarding actual genre labels or With regard to external resources to enrich
their scope and consistency. There is a substan- the classification task, Zhang et al. (2019) ex-
tial amount of previous work on the definition of periment with external knowledge graphs to en-
genre taxonomies, genre ontologies, or sets of la- rich embedding information in order to ultimately
bels (Biber, 1988; Lee, 2002; Sharoff, 2018; Un- improve language understanding. They use struc-
derwood, 2014; Rehm, 2005). Since we work with tural knowledge represented by Wikidata enti-
the dataset provided by the organisers of the 2019 ties and their relation to each other. A mix of
GermEval shared task, we adopt their hierarchy of large-scale textual corpora and knowledge graphs
labels as our genre palette. In the following, we is used to further train language representation
focus on related work more relevant to our contri- exploiting ERNIE (Sun et al., 2019), consider-
bution. ing lexical, syntactic, and structural information.
Wang et al. (2009) propose and evaluate an ap-
With regard to text and document classifica- proach to improve text classification with knowl-
tion, BERT (Bidirectional Encoder Representa- edge from Wikipedia. Based on a bag of words
tions from Transformers) (Devlin et al., 2019) is approach, they derive a thesaurus of concepts from
a pre-trained embedding model that yields state Wikipedia and use it for document expansion. The
of the art results in a wide span of NLP tasks, resulting document representation improves the
such as question answering, textual entailment and performance of an SVM classifier for predicting
natural language inference learning (Artetxe and text categories.
Schwenk, 2018). Adhikari et al. (2019) are among
the first to apply BERT to document classifica- 3 Dataset and Task
tion. Acknowledging challenges like incorporat-
Our experiments are modelled on the GermEval
ing syntactic information, or predicting multiple
2019 shared task and deal with the classification
labels, they describe how they adapt BERT for
of books. The dataset contains 20,784 German
the document classification task. In general, they
books. Each record has:
introduce a fully-connected layer over the final
hidden state that contains one neuron each repre- • A title.
senting an input token, and further optimize the
model choosing soft-max classifier parameters to • A list of authors. The average number of
weight the hidden state layer. They report state authors per book is 1.13, with most books
of the art results in experiments based on four (14,970) having a single author and one out-
popular datasets. An approach exploiting Hierar- lier with 28 authors.
chical Attention Networks is presented by Yang
et al. (2016). Their model introduces a hierarchi- • A short descriptive text (blurb) with an aver-
cal structure to represent the hierarchical nature age length of 95 words.
of a document. Yang et al. (2016) derive atten- • A URL pointing to a page on the publisher’s
tion on the word and sentence level, which makes website.
the attention mechanisms react flexibly to long and
short distant context information during the build- • An ISBN number.
ing of the document representations. They test
their approach on six large scale text classification • The date of publication.
problems and outperform previous methods sub-
The books are labeled according to the hierar-
stantially by increasing accuracy by about 3 to 4
chy used by the German publisher Random House.
percentage points. Aly et al. (2019) (the organisers
This taxonomy includes a mix of genre and top-
of the GermEval 2019 shared task on hierarchical
ical categories. It has eight top-level genre cat-
text classification) use shallow capsule networks,
egories, 93 on the second level and 242 on the
reporting that these work well on structured data
2
for example in the field of visual inference, and Note that this is not the dataset used in the shared task.
most detailed third level. The eight top-level Train Validation Test
labels are ‘Ganzheitliches Bewusstsein’ (holistic 12,681 1,834 3,641
Gender
awareness/consciousness), ‘Künste’ (arts), ‘Sach- (87%) (88%) (88%)
buch’ (non-fiction), ‘Kinderbuch & Jugendbuch’ 10,407 1,549 3,010
Author emb.
(children and young adults), ‘Ratgeber’ (coun- (72%) (75%) (72%)
selor/advisor), ‘Literatur & Unterhaltung’ (liter- Total books 14,548 2,079 4,157
ature and entertainment), ‘Glaube & Ethik’ (faith
and ethics), ‘Architektur & Garten’ (architecture Table 1: Availability of additional data with respect to
the dataset (relative numbers in parenthesis).
and garden). We refer to the shared task descrip-
tion3 for details on the lower levels of the ontol-
ogy. • Median word length in blurb.
Note that we do not have access to any of the
full texts. Hence, we use the blurbs as input for • Age in years after publication date.
BERT. Given the relatively short average length of • Probability of first author being male or fe-
the blurbs, this considerably decreases the amount male based on the Gender-by-Name dataset4 .
of data points available for a single book. Available for 87% of books in training set
The shared task is divided into two sub-task. (see Table 1).
Sub-task A is to classify a book, using the in-
formation provided as explained above, according The statistics (length, average, etc.) regarding
to the top-level of the taxonomy, selecting one or blurbs and titles are added in an attempt to make
more of the eight labels. Sub-task B is to classify a certain characteristics explicit to the classifier. For
book according to the detailed taxonomy, specify- example, books labeled ‘Kinderbuch & Jugend-
ing labels on the second and third level of the tax- buch’ (children and young adults) have a title that
onomy as well (in total 343 labels). This renders is on average 5.47 words long, whereas books la-
both sub-tasks a multi-label classification task. beled ‘Künste’ (arts) on average have shorter titles
of 3.46 words. The binary feature for academic ti-
4 Experiments tle is based on the assumption that academics are
As indicated in Section 1, we base our experiments more likely to write non-fiction. The gender fea-
on BERT in order to explore if it can be success- ture is included to explore (and potentially exploit)
fully adopted to the task of book or document clas- whether or not there is a gender-bias for particular
sification. We use the pre-trained models and en- genres.
rich them with additional metadata and tune the 4.2 Author Embeddings
models for both classification sub-tasks.
Whereas one should not judge a book by its cover,
4.1 Metadata Features we argue that additional information on the au-
In addition to the metadata provided by the organ- thor can support the classification task. Authors
isers of the shared task (see Section 3), we add the often adhere to their specific style of writing and
following features. are likely to specialize in a specific genre.
To be precise, we want to include author iden-
• Number of authors. tity information, which can be retrieved by se-
lecting particular properties from, for example,
• Academic title (Dr. or Prof.), if found in au- the Wikidata knowledge graph (such as date of
thor names (0 or 1). birth, nationality, or other biographical features).
A drawback of this approach, however, is that one
• Number of words in title.
has to manually select and filter those properties
• Number of words in blurb. that improve classification performance. This is
why, instead, we follow a more generic approach
• Length of longest word in blurb. and utilize automatically generated graph embed-
dings as author representations.
• Mean word length in blurb.
4
Probability of given names being male/female based on
3
https://competitions.codalab.org/ US names from 1930-2015. https://data.world/
competitions/20139 howarder/gender-by-name
Title Text Metadata Author Embeddings
BERT
12 layers
Concatenate
2-layer MLP
Output Layer
Table 2: Evaluation scores (micro avg.) on validation set with respect to the features used for classification. The
model with BERT-German, metadata and author embeddings yields the highest F1-scores on both tasks and was
accordingly submitted to the GermEval 2019 competition. The scores in the last row are the result on the test set
as reported by Remus et al., 2019.
Coenzym Q10
Ratgeber (I); Gesundheit & Ernährung (II) Gesundheit & Ernährung (II)
Dr. med. Gisela Rauch-Petz
Table 3: Book examples and their correct and predicted labels. Hierarchical label level is in parenthesis.
mance. The average number of words per blurb ing only a fraction of BERT’s computing re-
is 95 and only 0.25% of books exceed our cut-off sources. The BERT model trained for German
point of 300 words per blurb. In addition, the dis- (from scratch) outperforms the multilingual BERT
tribution of labeled books is imbalanced, i.e. for model by under three points for sub-task A and
many classes only a single digit number of training over six points for sub-task B, confirming the find-
instances exist (Fig. 3). Thus, this task can be con- ings reported by the creators of the BERT-German
sidered a low resource scenario, where including models for earlier GermEval shared tasks.
related data (such as author embeddings and au- While generally on par for sub-task A11 , for
thor identity features such as gender and academic sub-task B there is a relatively large discrepancy
title) or making certain characteristics more ex- between precision and recall scores. In all setups,
plicit (title and blurb length statistics) helps. Fur- precision is considerably higher than recall. We
thermore, it should be noted that the blurbs do expect this to be down to the fact that for some
not provide summary-like abstracts of the book, of the 343 labels in sub-task B, there are very
but instead act as teasers, intended to persuade the few instances. This means that if the classifier
reader to buy the book. predicts a certain label, it is likely to be correct
(i. e., high precision), but for many instances hav-
As reflected by the recent popularity of deep
ing low-frequency labels, this low-frequency label
transformer models, they considerably outperform
is never predicted (i. e., low recall).
the Logistic Regression baseline using TF-IDF
As mentioned in Section 4.4, we neglect the hi-
representation of the blurbs. However, for the
erarchical nature of the labels and flatten the hi-
simpler sub-task A, the performance difference
erarchy (with a depth of three levels) to a sin-
between the baseline model and the multilingual
11
BERT model is only six points, while consum- Except for the Author-only setup.
gle set of 343 labels for sub-task B. We expect and publication metadata improves the classifi-
this to have negative impact on performance, be- cation task essentially compared a text-only ap-
cause it allows a scenario in which, for a par- proach. Especially, when metadata feature en-
ticular book, we predict a label from the first gineering is less trivial, adding additional task-
level and also a non-matching label from the sec- specific information from an external knowledge
ond level of the hierarchy. The example Coen- source such as Wikidata can help significantly.
zym Q10 (Table 3) demonstrates this issue. While The source code of our experiments and the
the model correctly predicts the second level label trained models are publicly available12 .
Gesundheit & Ernährung (health & diet), it misses
Future work comprises the use of hierarchi-
the corresponding first level label Ratgeber (advi-
cal information in a post-processing step to refine
sor). Given the model’s tendency to higher pre-
the classification. Another promising approach to
cision rather than recall in sub-task B, as a post-
tackle the low resource problem for task B would
processing step we may want to take the most de-
be to use label embeddings. Many labels are simi-
tailed label (on the third level of the hierarchy) to
lar and semantically related. The relationships be-
be correct and manually fix the higher level labels
tween labels can be utilized to model in a joint
accordingly. We leave this for future work and
embedding space (Augenstein et al., 2018). How-
note that we expect this to improve performance,
ever, a severe challenge with regard to setting up
but it is hard to say by how much. We hypothesize
label embeddings is the quite heterogeneous cat-
that an MLP with more and bigger layers could
egory system that can often be found in use on-
improve the classification performance. However,
line. The Random House taxonomy (see above)
this would increase the number of parameters to be
includes category names, i. e., labels, that relate
trained, and thus requires more training data (such
to several different dimensions including, among
as the book’s text itself, or a summary of it).
others, genre, topic and function.
This work is done in the context of a larger
project that develops a platform for curation tech-
120
nologies. Under the umbrella of this project, the
100 classification of pieces of incoming text content
Number of label classes
Mikel Artetxe and Holger Schwenk. 2018. Mas- Ted Underwood. 2014. Understanding Genre in a
sively multilingual sentence embeddings for zero- Collection of a Million Volumes, Interim Report.
shot cross-lingual transfer and beyond. CoRR, figshare.
abs/1812.10464.
Denny Vrandecic and Markus Krötzsch. 2014. Wiki-
Isabelle Augenstein, Sebastian Ruder, and Anders data: a free collaborative knowledgebase. Commun.
Søgaard. 2018. Multi-task learning of pairwise ACM, 57(10):78–85.
sequence classification tasks over disparate label
Pu Patrick Wang, Jingjie Hu, Hua-Jun Zeng, and Zhi-
spaces. In Proceedings of the 2018 Conference
gang Chen. 2009. Using wikipedia knowledge to
of the North American Chapter of the Associa-
improve text classification. Knowledge and Infor-
tion for Computational Linguistics: Human Lan-
mation Systems, 19(3):265–281.
guage Technologies, Volume 1 (Long Papers), pages
1896–1906, New Orleans, Louisiana. Association Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
for Computational Linguistics. Alex Smola, and Eduard Hovy. 2016. Hierarchi-
cal attention networks for document classification.
Douglas Biber. 1988. Variation across Speech and In Proceedings of the 2016 Conference of the North
Writing. Cambridge University Press. American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
pages 1480–1489, San Diego, California. Associa-
Kristina Toutanova. 2019. BERT: Pre-training of
tion for Computational Linguistics.
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,
of the North American Chapter of the Association Maosong Sun, and Qun Liu. 2019. ERNIE: En-
for Computational Linguistics: Human Language hanced language representation with informative en-
Technologies, Volume 1 (Long and Short Papers), tities. In Proceedings of the 57th Annual Meet-
pages 4171–4186, Minneapolis, Minnesota. Associ- ing of the Association for Computational Linguis-
ation for Computational Linguistics. tics, pages 1441–1451, Florence, Italy. Association
for Computational Linguistics.
David Y. W. Lee. 2002. Genres, Registers, Text Types,
Domains, and Styles: Clarifying the Concepts and
Navigating a Path through the BNC Jungle. Lan-
guage Learning and Technology, 5(3):37–72.