Enriching BERT With Knowledge Graph Embeddings For Document Classification

Enriching BERT with Knowledge Graph Embeddings for Document
Classification
Malte Ostendorff1,2 , Peter Bourgonje1 , Maria Berger1 ,
Julián Moreno-Schneider1 , Georg Rehm1 , Bela Gipp2
1
Speech and Language Technology, DFKI GmbH, Germany
first.last@dfki.de
2
University of Konstanz, Germany
first.last@uni-konstanz.de
Abstract classification (Remus et al., 2019) and use the pre-

defined set of labels to evaluate our approach to
In this paper we focus on the classification
this classification task1 .
of books using short descriptive texts (cover
arXiv:1909.08402v1 [cs.CL] 18 Sep 2019
blurbs) and additional metadata. Building Deep neural language models have recently
upon BERT, a deep neural language model, we evolved to a successful method for representing
demonstrate how to combine text representa- text. In particular, Bidirectional Encoder Rep-
tions with metadata and knowledge graph em- resentations from Transformers (BERT; Devlin
beddings, which encode author information. et al., 2019) outperformed previous state-of-the-
Compared to the standard BERT approach we art methods by a large margin on various NLP
achieve considerably better results for the clas-
tasks. We adopt BERT for text-based classifica-
sification task. For a more coarse-grained clas-
sification using eight labels we achieve an F1- tion and extend the model with additional meta-
score of 87.20, while a detailed classification data provided in the context of the shared task,
using 343 labels yields an F1-score of 64.70. such as author, publisher, publishing date, etc.
We make the source code and trained models A key contribution of this paper is the inclu-
of our experiments publicly available. sion of additional (meta) data using a state-of-the-
art approach for text processing. Being a transfer
1 Introduction
learning approach, it facilitates the task solution
With ever-increasing amounts of data available, with external knowledge for a setup in which rela-
there is an increase in the need to offer tooling tively little training data is available. More pre-
to speed up processing, and eventually making cisely, we enrich BERT, as our pre-trained text
sense of this data. Because fully-automated tools representation model, with knowledge graph em-
to extract meaning from any given input to any beddings that are based on Wikidata (Vrandecic
desired level of detail have yet to be developed, and Krötzsch, 2014), add metadata provided by
this task is still at least supervised, and often (par- the shared task organisers (title, author(s), publish-
tially) resolved by humans; we refer to these hu- ing date, etc.) and collect additional information
mans as knowledge workers. Knowledge work- on authors for this particular document classifica-
ers are professionals that have to go through large tion task. As we do not rely on text-based fea-
amounts of data and consolidate, prepare and pro- tures alone but also utilize document metadata, we
cess it on a daily basis. This data can originate consider this as a document classification problem.
from highly diverse portals and resources and de- The proposed approach is an attempt to solve this
pending on type or category, the data needs to be problem exemplary for single dataset provided by
channelled through specific down-stream process- the organisers of the shared task.
ing pipelines. We aim to create a platform for cu-
ration technologies that can deal with such data 2 Related Work
from diverse sources and that provides natural lan- A central challenge in work on genre classification
guage processing (NLP) pipelines tailored to par- is the definition of a both rigid (for theoretical pur-
ticular content types and genres, rendering this ini- poses) and flexible (for practical purposes) mode
tial classification an important sub-task. 1
https://www.inf.uni-hamburg.
In this paper, we work with the dataset of the de/en/inst/ab/lt/resources/data/
2019 GermEval shared task on hierarchical text germeval-2019-hmc.html
of representation that is able to model various di- outperform CNNs, LSTMs and SVMs in this area.
mensions and characteristics of arbitrary text gen- They use the Web of Science (WOS) dataset and
res. The size of the challenge can be illustrated introduce a new real-world scenario dataset called
by the observation that there is no clear agreement Blurb Genre Collection (BGC)2 .
among researchers regarding actual genre labels or With regard to external resources to enrich
their scope and consistency. There is a substan- the classification task, Zhang et al. (2019) ex-
tial amount of previous work on the definition of periment with external knowledge graphs to en-
genre taxonomies, genre ontologies, or sets of la- rich embedding information in order to ultimately
bels (Biber, 1988; Lee, 2002; Sharoff, 2018; Un- improve language understanding. They use struc-
derwood, 2014; Rehm, 2005). Since we work with tural knowledge represented by Wikidata enti-
the dataset provided by the organisers of the 2019 ties and their relation to each other. A mix of
GermEval shared task, we adopt their hierarchy of large-scale textual corpora and knowledge graphs
labels as our genre palette. In the following, we is used to further train language representation
focus on related work more relevant to our contri- exploiting ERNIE (Sun et al., 2019), consider-
bution. ing lexical, syntactic, and structural information.
Wang et al. (2009) propose and evaluate an ap-
With regard to text and document classifica- proach to improve text classification with knowl-
tion, BERT (Bidirectional Encoder Representa- edge from Wikipedia. Based on a bag of words
tions from Transformers) (Devlin et al., 2019) is approach, they derive a thesaurus of concepts from
a pre-trained embedding model that yields state Wikipedia and use it for document expansion. The
of the art results in a wide span of NLP tasks, resulting document representation improves the
such as question answering, textual entailment and performance of an SVM classifier for predicting
natural language inference learning (Artetxe and text categories.
Schwenk, 2018). Adhikari et al. (2019) are among
the first to apply BERT to document classifica- 3 Dataset and Task
tion. Acknowledging challenges like incorporat-
Our experiments are modelled on the GermEval
ing syntactic information, or predicting multiple
2019 shared task and deal with the classification
labels, they describe how they adapt BERT for
of books. The dataset contains 20,784 German
the document classification task. In general, they
books. Each record has:
introduce a fully-connected layer over the final
hidden state that contains one neuron each repre- • A title.
senting an input token, and further optimize the
model choosing soft-max classifier parameters to • A list of authors. The average number of
weight the hidden state layer. They report state authors per book is 1.13, with most books
of the art results in experiments based on four (14,970) having a single author and one out-
popular datasets. An approach exploiting Hierar- lier with 28 authors.
chical Attention Networks is presented by Yang
et al. (2016). Their model introduces a hierarchi- • A short descriptive text (blurb) with an aver-
cal structure to represent the hierarchical nature age length of 95 words.
of a document. Yang et al. (2016) derive atten- • A URL pointing to a page on the publisher’s
tion on the word and sentence level, which makes website.
the attention mechanisms react flexibly to long and
short distant context information during the build- • An ISBN number.
ing of the document representations. They test
their approach on six large scale text classification • The date of publication.
problems and outperform previous methods sub-
The books are labeled according to the hierar-
stantially by increasing accuracy by about 3 to 4
chy used by the German publisher Random House.
percentage points. Aly et al. (2019) (the organisers
This taxonomy includes a mix of genre and top-
of the GermEval 2019 shared task on hierarchical
ical categories. It has eight top-level genre cat-
text classification) use shallow capsule networks,
egories, 93 on the second level and 242 on the
reporting that these work well on structured data
2
for example in the field of visual inference, and Note that this is not the dataset used in the shared task.
most detailed third level. The eight top-level Train Validation Test
labels are ‘Ganzheitliches Bewusstsein’ (holistic 12,681 1,834 3,641
Gender
awareness/consciousness), ‘Künste’ (arts), ‘Sach- (87%) (88%) (88%)
buch’ (non-fiction), ‘Kinderbuch & Jugendbuch’ 10,407 1,549 3,010
Author emb.
(children and young adults), ‘Ratgeber’ (coun- (72%) (75%) (72%)
selor/advisor), ‘Literatur & Unterhaltung’ (liter- Total books 14,548 2,079 4,157
ature and entertainment), ‘Glaube & Ethik’ (faith
and ethics), ‘Architektur & Garten’ (architecture Table 1: Availability of additional data with respect to
the dataset (relative numbers in parenthesis).
and garden). We refer to the shared task descrip-
tion3 for details on the lower levels of the ontol-
ogy. • Median word length in blurb.
Note that we do not have access to any of the
full texts. Hence, we use the blurbs as input for • Age in years after publication date.
BERT. Given the relatively short average length of • Probability of first author being male or fe-
the blurbs, this considerably decreases the amount male based on the Gender-by-Name dataset4 .
of data points available for a single book. Available for 87% of books in training set
The shared task is divided into two sub-task. (see Table 1).
Sub-task A is to classify a book, using the in-
formation provided as explained above, according The statistics (length, average, etc.) regarding
to the top-level of the taxonomy, selecting one or blurbs and titles are added in an attempt to make
more of the eight labels. Sub-task B is to classify a certain characteristics explicit to the classifier. For
book according to the detailed taxonomy, specify- example, books labeled ‘Kinderbuch & Jugend-
ing labels on the second and third level of the tax- buch’ (children and young adults) have a title that
onomy as well (in total 343 labels). This renders is on average 5.47 words long, whereas books la-
both sub-tasks a multi-label classification task. beled ‘Künste’ (arts) on average have shorter titles
of 3.46 words. The binary feature for academic ti-
4 Experiments tle is based on the assumption that academics are
As indicated in Section 1, we base our experiments more likely to write non-fiction. The gender fea-
on BERT in order to explore if it can be success- ture is included to explore (and potentially exploit)
fully adopted to the task of book or document clas- whether or not there is a gender-bias for particular
sification. We use the pre-trained models and en- genres.
rich them with additional metadata and tune the 4.2 Author Embeddings
models for both classification sub-tasks.
Whereas one should not judge a book by its cover,
4.1 Metadata Features we argue that additional information on the au-
In addition to the metadata provided by the organ- thor can support the classification task. Authors
isers of the shared task (see Section 3), we add the often adhere to their specific style of writing and
following features. are likely to specialize in a specific genre.
To be precise, we want to include author iden-
• Number of authors. tity information, which can be retrieved by se-
lecting particular properties from, for example,
• Academic title (Dr. or Prof.), if found in au- the Wikidata knowledge graph (such as date of
thor names (0 or 1). birth, nationality, or other biographical features).
A drawback of this approach, however, is that one
• Number of words in title.
has to manually select and filter those properties
• Number of words in blurb. that improve classification performance. This is
why, instead, we follow a more generic approach
• Length of longest word in blurb. and utilize automatically generated graph embed-
dings as author representations.
• Mean word length in blurb.
4
Probability of given names being male/female based on
3
https://competitions.codalab.org/ US names from 1930-2015. https://data.world/
competitions/20139 howarder/gender-by-name
Title Text Metadata Author Embeddings
BERT
12 layers
Concatenate
2-layer MLP
Output Layer
Figure 2: Model architecture used in our experiments.

Text-features are fed through BERT, concatenated with
metadata and author embeddings and combined in a
multilayer perceptron (MLP).
Figure 1: Visualization of Wikidata embeddings
for Franz Kafka (3D-projection with PCA)5 . Nearest
neighbours in original 200D space: Arthur Schnitzler, pre-trained on German text, as published by the
E.T.A Hoffmann and Hans Christian Andersen. German company Deepset AI8 . This model was
trained from scratch on the German Wikipedia,
news articles and court decisions9 . Deepset AI
Graph embedding methods create dense vector
reports better performance for the German BERT
representations for each node such that distances
models compared to the multilingual models on
between these vectors predict the occurrence of
previous German shared tasks (GermEval2018-
edges in the graph. The node distance can be in-
Fine and GermEval 2014).
terpreted as topical similarity between the corre-
sponding authors. 4.4 Model Architecture
We rely on pre-trained embeddings based on
PyTorch BigGraph (Lerer et al., 2019). The graph Our neural network architecture, shown in Fig-
model is trained on the full Wikidata graph, us- ure 2, resembles the original BERT model (Devlin
ing a translation operator to represent relations6 . et al., 2019) and combines text- and non-text fea-
Figure 1 visualizes the locality of the author em- tures with a multilayer perceptron (MLP).
beddings. The BERT architecture uses 12 hidden layers,
To derive the author embeddings, we look up each layer consists of 768 units. To derive con-
Wikipedia articles that match with the author textualized representations from textual features,
names and map the articles to the corresponding the book title and blurb are concatenated and then
Wikidata items7 . If a book has multiple authors, fed through BERT. To minimize the GPU mem-
the embedding of the first author for which an ory consumption, we limit the input length to 300
embedding is available is used. Following this tokens (which is shorter than BERT’s hard-coded
method, we are able to retrieve embeddings for limit of 512 tokens). Only 0.25% of blurbs in the
72% of the books in the training and test set (see training set consist of more than 300 words, so this
Table 1). cut-off can be expected to have minor impact.
The non-text features are generated in a sep-
4.3 Pre-trained German Language Model arate preprocessing step. The metadata features
Although the pre-trained BERT language models are represented as a ten-dimensional vector (two
are multilingual and, therefore, support German, dimensions for gender, see Section 4.1). Author
we rely on a BERT model that was exclusively embedding vectors have a length of 200 (see Sec-
6 8
Pre-trained Knowledge Graph Embeddings. Details on BERT-German training procedure: https:
https://github.com/facebookresearch/ //deepset.ai/german-bert
PyTorch-BigGraph#pre-trained-embeddings 9
German legal documents used to train BERT-German:
7
Mapping Wikipedia pages to Wikidata IDs and vice http://openlegaldata.io/research/2019/
versa. https://github.com/jcklie/wikimapper 02/19/court-decision-dataset.html
tion 4.2). In the next step, all three representa- models that are trained on the training set and
tions are concatenated and passed into a MLP with evaluated on the validation set. For the final sub-
two layers, 1024 units each and ReLu activation mission to the shared task competition, the best-
function. During training, the MLP is supposed to scoring setup is used and trained on the training
learn a non-linear combination of its input repre- and validation sets combined.
sentations. Finally, the output layer does the actual We are able to demonstrate that incorporating
classification. In the SoftMax output layer each metadata features and author embeddings leads to
unit corresponds to a class label. For sub-task A better results for both sub-tasks. With an F1-score
the output dimension is eight. We treat sub-task of 87.20 for task A and 64.70 for task B, the setup
B as a standard multi-label classification problem, using BERT-German with metadata features and
i. e., we neglect any hierarchical information. Ac- author embeddings (1) outperforms all other se-
cordingly, the output layer for sub-task B has 343 tups. Looking at the precision score only, BERT-
units. When the value of an output unit is above German with metadata features (2) but without au-
a given threshold the corresponding label is pre- thor embeddings performs best.
dicted, whereby thresholds are defined separately In comparison to the baseline (7), our evaluation
for each class. The optimum was found by varying shows that deep transformer models like BERT
the threshold in steps of 0.1 in the interval from 0 considerably outperform the classical TF-IDF ap-
to 1. proach, also when the input is the same (using
the title10 and blurb only). BERT-German (4) and
4.5 Implementation BERT-Multilingual (5) are only using text-based
Training is performed with batch size b = 16, features (title and blurb), whereby the text repre-
dropout probability d = 0.1, learning rate η = sentations of the BERT-layers are directly fed into
2−5 (Adam optimizer) and 5 training epochs. the classification layer.
These hyperparameters are the ones proposed by To establish the information gain of author em-
Devlin et al. (2019) for BERT fine-tuning. We beddings, we train a linear classifier on author
did not experiment with hyperparameter tuning embeddings, using this as the only feature. The
ourselves except for optimizing the classification author-only model (6) is exclusively evaluated on
threshold for each class separately. All experi- books for which author embeddings are available,
ments are run on a GeForce GTX 1080 Ti (11 so the numbers are based on a slightly smaller val-
GB), whereby a single training epoch takes up to idation set. With an F1-score of 61.99 and 32.13
10min. If there is no single label for which predic- for sub-tasks A and B, respectively, the author
tion probability is above the classification thresh- model yields the worst result. However, the infor-
old, the most popular label (Literatur & Unterhal- mation contained in the author embeddings help
tung) is used as prediction. improve performance, as the results of the best-
performing setup show. When evaluating the best
4.6 Baseline model (1) only on books for that author embed-
To compare against a relatively simple baseline, dings are available, we find a further improvement
we implemented a Logistic Regression classifier with respect to F1 score (task A: from 87.20 to
chain from scikit-learn (Pedregosa et al., 2011). 87.81; task B: 64.70 to 65.74).
This baseline uses the text only and converts it
to TF-IDF vectors. As with the BERT model, it 6 Discussion
performs 8-class multi-label classification for sub-
The best performing setup uses BERT-German
task A and 343-class multi-label classification for
with metadata features and author embeddings.
sub-task B, ignoring the hierarchical aspect in the
In this setup the most data is made available to
labels.
the model, indicating that, perhaps not surpris-
ingly, more data leads to better classification per-
5 Results
formance. We expect that having access to the ac-
Table 2 shows the results of our experiments. As tual text of the book will further increase perfor-
prescribed by the shared task, the essential evalu- 10
The baseline model uses the blurbs only, without the ti-
ation metric is the micro-averaged F1-score. All tle, but we do not expect that including the title in the input
scores reported in this paper are obtained using would make up for the considerable gap between the two.
Sub-Task A – 8 labels Sub-Task B – 343 labels
Model / Features F1 Prec. Recall F1 Prec. Recall
(1) BERT-German + Metadata + Author 87.20 88.76 85.70 64.70 83.78 52.70
(2) BERT-German + Metadata 86.90 89.65 84.30 63.96 83.94 51.67
(3) BERT-German + Author 86.84 89.02 84.75 64.41 82.02 53.03
(4) BERT-German 86.65 89.65 83.86 60.51 83.44 47.47
(5) BERT-Base-Multilingual-Cased 83.94 86.31 81.70 54.08 82.63 40.19
(6) Author 61.99 75.59 52.54 32.13 72.39 20.65
(7) Baseline 77.00 79.00 74.00 45.00 67.00 34.00
Results of best model (1) on test set 88.00 85.00 86.00 78.00 52.00 62.00
Table 2: Evaluation scores (micro avg.) on validation set with respect to the features used for classification. The
model with BERT-German, metadata and author embeddings yields the highest F1-scores on both tasks and was
accordingly submitted to the GermEval 2019 competition. The scores in the last row are the result on the test set
as reported by Remus et al., 2019.
Title / Author Correct Labels Predicted Labels
Coenzym Q10
Ratgeber (I); Gesundheit & Ernährung (II) Gesundheit & Ernährung (II)
Dr. med. Gisela Rauch-Petz
Gelebte Wertschätzung Glaube & Ethik (I);

Sachbuch (I); Politik & Gesellschaft (II)
Barbara von Meibom Psychologie & Spiritualität (II)
Wie Romane entstehen Literatur & Unterhaltung (I); Sachbuch (I);
Hanns-Josef Ortheil, Romane & Erzählungen (II); Literatur & Unterhaltung (I)
Klaus Siblewski Briefe, Essays, Gespräche (II)
Das Grab ist erst der Anfang Literatur & Unterhaltung (I); Literatur & Unterhaltung (I);
Kathy Reichs Krimi & Thriller (II) Krimi & Thriller (II)
Table 3: Book examples and their correct and predicted labels. Hierarchical label level is in parenthesis.
mance. The average number of words per blurb ing only a fraction of BERT’s computing re-
is 95 and only 0.25% of books exceed our cut-off sources. The BERT model trained for German
point of 300 words per blurb. In addition, the dis- (from scratch) outperforms the multilingual BERT
tribution of labeled books is imbalanced, i.e. for model by under three points for sub-task A and
many classes only a single digit number of training over six points for sub-task B, confirming the find-
instances exist (Fig. 3). Thus, this task can be con- ings reported by the creators of the BERT-German
sidered a low resource scenario, where including models for earlier GermEval shared tasks.
related data (such as author embeddings and au- While generally on par for sub-task A11 , for
thor identity features such as gender and academic sub-task B there is a relatively large discrepancy
title) or making certain characteristics more ex- between precision and recall scores. In all setups,
plicit (title and blurb length statistics) helps. Fur- precision is considerably higher than recall. We
thermore, it should be noted that the blurbs do expect this to be down to the fact that for some
not provide summary-like abstracts of the book, of the 343 labels in sub-task B, there are very
but instead act as teasers, intended to persuade the few instances. This means that if the classifier
reader to buy the book. predicts a certain label, it is likely to be correct
(i. e., high precision), but for many instances hav-
As reflected by the recent popularity of deep
ing low-frequency labels, this low-frequency label
transformer models, they considerably outperform
is never predicted (i. e., low recall).
the Logistic Regression baseline using TF-IDF
As mentioned in Section 4.4, we neglect the hi-
representation of the blurbs. However, for the
erarchical nature of the labels and flatten the hi-
simpler sub-task A, the performance difference
erarchy (with a depth of three levels) to a sin-
between the baseline model and the multilingual
11
BERT model is only six points, while consum- Except for the Author-only setup.
gle set of 343 labels for sub-task B. We expect and publication metadata improves the classifi-
this to have negative impact on performance, be- cation task essentially compared a text-only ap-
cause it allows a scenario in which, for a par- proach. Especially, when metadata feature en-
ticular book, we predict a label from the first gineering is less trivial, adding additional task-
level and also a non-matching label from the sec- specific information from an external knowledge
ond level of the hierarchy. The example Coen- source such as Wikidata can help significantly.
zym Q10 (Table 3) demonstrates this issue. While The source code of our experiments and the
the model correctly predicts the second level label trained models are publicly available12 .
Gesundheit & Ernährung (health & diet), it misses
Future work comprises the use of hierarchi-
the corresponding first level label Ratgeber (advi-
cal information in a post-processing step to refine
sor). Given the model’s tendency to higher pre-
the classification. Another promising approach to
cision rather than recall in sub-task B, as a post-
tackle the low resource problem for task B would
processing step we may want to take the most de-
be to use label embeddings. Many labels are simi-
tailed label (on the third level of the hierarchy) to
lar and semantically related. The relationships be-
be correct and manually fix the higher level labels
tween labels can be utilized to model in a joint
accordingly. We leave this for future work and
embedding space (Augenstein et al., 2018). How-
note that we expect this to improve performance,
ever, a severe challenge with regard to setting up
but it is hard to say by how much. We hypothesize
label embeddings is the quite heterogeneous cat-
that an MLP with more and bigger layers could
egory system that can often be found in use on-
improve the classification performance. However,
line. The Random House taxonomy (see above)
this would increase the number of parameters to be
includes category names, i. e., labels, that relate
trained, and thus requires more training data (such
to several different dimensions including, among
as the book’s text itself, or a summary of it).
others, genre, topic and function.
This work is done in the context of a larger
project that develops a platform for curation tech-
120
nologies. Under the umbrella of this project, the
100 classification of pieces of incoming text content
Number of label classes
80 according to an ontology is an important step that

allows the routing of this content to particular, spe-
60
cialized processing workflows, including parame-
40 terising the included pipelines. Depending on con-
20 tent type and genre, it may make sense to apply
OCR post-processing (for digitized books from
0
1-9 10-19 20-29 30-39 40-49 50 centuries ago), machine translation (for content in
Available number of samples per label in training set
languages unknown to the user), information ex-
Figure 3: In sub-task B for many low-hierarchical la- traction, or other particular and specialized proce-
bels only a small number of training samples exist, dures. Constructing such a generic ontology for
making it more difficult to predict the correct label. digital content is a challenging task, and classifi-
cation performance is heavily dependent on input
data (both in shape and amount) and on the na-
7 Conclusions and Future Work ture of the ontology to be used (in the case of this
paper, the one predefined by the shared task organ-
In this paper we presented a way of enriching isers). In the context of our project, we continue to
BERT with knowledge graph embeddings and ad- work towards a maximally generic content ontol-
ditional metadata. Exploiting the linked knowl- ogy, and at the same time towards applied classi-
edge that underlies Wikidata improves perfor- fication architectures such as the one presented in
mance for our task of document classification. this paper.
With this approach we improve the standard BERT
models by up to four percentage points in accu-
racy. Furthermore, our results reveal that with
12
task-specific information such as author names https://ostendorff.org/r/germeval19
Acknowledgments Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-
fort, Vincent Michel, Bertrand Thirion, Olivier
This research is funded by the German Fed- Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
eral Ministry of Education and Research (BMBF) Weiss, Vincent Dubourg, Jake Vanderplas, Alexan-
through the “Unternehmen Region”, instru- dre Passos, David Cournapeau, Matthieu Brucher,
Matthieu Perrot, and Edouard Duchesnay. 2011.
ment “Wachstumskern” QURATOR (grant no. Scikit-learn: Machine Learning in Python. Journal
03WKDA1A). We would like to thank the anony- of Machine Learning Research, 12:2825–2830.
mous reviewers for comments on an earlier ver-
Georg Rehm. 2005. Hypertextsorten: Definition –
sion of this manuscript. Struktur – Klassifikation. Ph.D. thesis, Justus-
Liebig-Universität Gießen, Norderstedt.
References Steffen Remus, Rami Aly, and Chris Biemann. 2019.

GermEval 2019 Task 1 : Hierarchical Classification
Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and of Blurbs. In Proceedings of the GermEval 2019
Jimmy Lin. 2019. Docbert: BERT for document Workshop, pages 1–13, Erlangen, Germany.
classification. CoRR, abs/1904.08398.
Serge Sharoff. 2018. Functional text dimensions for
Rami Aly, Steffen Remus, and Chris Biemann. 2019. the annotation of web corpora. Corpora, 13(1):65–
Hierarchical multi-label classification of text with 95.
capsule networks. In Proceedings of the 57th An-
nual Meeting of the Association for Computational Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang,
Linguistics: Student Research Workshop, pages X. Tian, D. Zhu, H. Tian, and H. Wu. 2019. Ernie:
323–330, Florence, Italy. Association for Computa- Enhanced representation through knowledge inte-
tional Linguistics. gration. arXiv:1904.09223.
Mikel Artetxe and Holger Schwenk. 2018. Mas- Ted Underwood. 2014. Understanding Genre in a
sively multilingual sentence embeddings for zero- Collection of a Million Volumes, Interim Report.
shot cross-lingual transfer and beyond. CoRR, figshare.
abs/1812.10464.
Denny Vrandecic and Markus Krötzsch. 2014. Wiki-
Isabelle Augenstein, Sebastian Ruder, and Anders data: a free collaborative knowledgebase. Commun.
Søgaard. 2018. Multi-task learning of pairwise ACM, 57(10):78–85.
sequence classification tasks over disparate label
Pu Patrick Wang, Jingjie Hu, Hua-Jun Zeng, and Zhi-
spaces. In Proceedings of the 2018 Conference
gang Chen. 2009. Using wikipedia knowledge to
of the North American Chapter of the Associa-
improve text classification. Knowledge and Infor-
tion for Computational Linguistics: Human Lan-
mation Systems, 19(3):265–281.
guage Technologies, Volume 1 (Long Papers), pages
1896–1906, New Orleans, Louisiana. Association Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
for Computational Linguistics. Alex Smola, and Eduard Hovy. 2016. Hierarchi-
cal attention networks for document classification.
Douglas Biber. 1988. Variation across Speech and In Proceedings of the 2016 Conference of the North
Writing. Cambridge University Press. American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
pages 1480–1489, San Diego, California. Associa-
Kristina Toutanova. 2019. BERT: Pre-training of
tion for Computational Linguistics.
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,
of the North American Chapter of the Association Maosong Sun, and Qun Liu. 2019. ERNIE: En-
for Computational Linguistics: Human Language hanced language representation with informative en-
Technologies, Volume 1 (Long and Short Papers), tities. In Proceedings of the 57th Annual Meet-
pages 4171–4186, Minneapolis, Minnesota. Associ- ing of the Association for Computational Linguis-
ation for Computational Linguistics. tics, pages 1441–1451, Florence, Italy. Association
for Computational Linguistics.
David Y. W. Lee. 2002. Genres, Registers, Text Types,
Domains, and Styles: Clarifying the Concepts and
Navigating a Path through the BNC Jungle. Lan-
guage Learning and Technology, 5(3):37–72.
Adam Lerer, Ledell Wu, Jiajun Shen, Timothee

Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex
Peysakhovich. 2019. PyTorch-BigGraph: A Large-
scale Graph Embedding System. In Proceedings of
the 2nd SysML Conference, Palo Alto, CA, USA.

Enriching BERT With Knowledge Graph Embeddings For Document Classification

Uploaded by

Copyright:

Available Formats

Enriching BERT With Knowledge Graph Embeddings For Document Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Enriching BERT With Knowledge Graph Embeddings For Document Classification

Uploaded by

Copyright:

Available Formats

Enriching BERT with Knowledge Graph Embeddings for Document

Abstract classification (Remus et al., 2019) and use the pre-

Figure 2: Model architecture used in our experiments.

Title / Author Correct Labels Predicted Labels

Gelebte Wertschätzung Glaube & Ethik (I);

80 according to an ontology is an important step that

References Steffen Remus, Rami Aly, and Chris Biemann. 2019.

Adam Lerer, Ledell Wu, Jiajun Shen, Timothee

You might also like