BERT Model
BERT Model
BERT Model
MASTER
Huijzer, T.H.
Award date:
2019
Link to publication
Disclaimer
This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student
theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document
as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required
minimum study period may vary in duration.
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
Department of Mathematics and Computer Science
Automatically responding
to customers
Master Thesis
Rik Huijzer
Supervisor:
dr. N. Yakovets
Examination committee:
dr. N. Yakovets
dr. G.H.L. Fletcher
dr. J. Vanschoren
In the last few years artificial intelligence has considerably advanced the natural language pro-
cessing (NLP) field. The graduation company is interested in seeing whether this can be used
to automate customer support. NLP has evolved to contain many tasks. Intent classification is
used to classify the intent of a sentence like ‘what is the weather in London tomorrow?’. The
intent for this sentence could be ‘get weather’. Named-entity recognition (NER) aims to extract
information from subparts of the sentence. For example, ‘London’ is a location and ‘tomorrow’ is
a date. Intents and entities are used by chatbots to understand the text written by users. This has
caused intent classification and NER to have the following practical constraints. The text should
be analysed in real-time and training data consists of a few dozen training examples. The latter
makes it an interesting problem from a machine learning perspective.
Multiple systems and services provide intent classification and NER. Accuracy of classification
differs per system. Higher accuracy means responding correctly to customer utterances more
often. Many systems claim to make the fewest mistakes during classification when comparing
their system to others. To validate this a benchmarking tool is created. This tool is aimed on
creating comparisons in such a way that users can easily run new or re-run existing evaluations.
The code can be extended to allow comparison of more datasets and systems.
To improve the accuracy of intent classification and NER, deep learning architectures for NLP
have been investigated. New accuracy records are set every few months for various NLP tasks. One
of the most promising systems at the time of writing is considered. This system, Google BERT,
uses context from both sides of some word to predict the meaning of the word. For example, the
meaning of the word ‘bank’ differs in the sentences ‘river bank’ and ‘bank account’. BERT has
shown state-of-the-art results for eleven NLP tasks. An attempt is made to apply the model to
intent classification. Compared to baseline models applied by industry, this obtained significant
increases in running time, but not in accuracy. A second attempt trained the system jointly on
intent classification and NER. BERT is well-suited for joint training, because it uses context in all
hidden layers of the network. Information from the intent classification task is used, in all layers,
for making NER predictions and vice versa. It is shown that joint training with BERT is feasible,
and can be used to lower training time when compared to separate training of BERT. Future work
is needed to see whether the improvements in accuracy are significant.
This master thesis is aimed to graduate the Computer Science and Engineering master at Eind-
hoven University of Technology. The project is carried out at Devhouse Spindle in Groningen.
The thesis concludes my education in Eindhoven. I consider choosing this study as one of the
best decisions of my life. Getting good grades for assignments and exams was tough, but grad-
ing was fair. Teachers were always busy, but also always willing to help. I would like to thank
Nikolay Yakovets for guiding the project. I appreciate the fact that Nikolay allowed me to choose
a subject that does not directly aid him in his research. Furthermore, I would like to thank
Devhouse Spindle for allowing me to graduate at their company. Little could be improved about
the workplace, colleagues and office in general. I would like to thank my graduation tutor Jukka
Koivunen for the guidance, and helping me fix any bugs related to object-oriented code. Thanks
to Jukka and Hylke for giving feedback on the thesis. Thanks to all colleagues for the relaxed
work environment, the advices and for helping me fix bugs I would have been stuck on for many
hours.
In general I would like to thank Rodger, Erik, Mark, Leon, Stan, Astrid, Sjors, Nol, Tom,
Suzanne, Reggie, Imke, the student ice hockey team and the student rugby team, to help me
getting through the study. Finally, special thanks go out to my parents and siblings for motivating
me to aim high.
Rik Huijzer
1 February 2019
Contents vii
List of abbreviations ix
1 Introduction 1
1.1 Thesis context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Project goal and outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Preliminaries 5
2.1 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Natural language understanding . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 F score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Vanishing gradient problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Gated recurrent units and LSTMs . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Bidirectional recurrent neural networks . . . . . . . . . . . . . . . . . . . . 11
2.2.5 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.6 ELMo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.7 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Benchmarking 15
3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Available datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Rasa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 DeepPavlov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 Cloud services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Tool and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.2 Benchmark results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.1 Benchmarking system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Improving accuracy 25
4.1 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Duplicate finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.2 Using data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.3 Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.4 Meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.5 Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.3 Joint training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.4 BERT joint training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Implementation improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Conclusions 35
Bibliography 37
Appendix 45
A bench usage 45
E improv 54
E.1 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
E.2 TPU and the Estimator API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
E.3 Additional experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
F Runs 56
Introduction
This master thesis is the final report of the graduation project for the Computer Science and
Engineering master at Eindhoven University of Techonology (TU/e). The project is carried out at
Devhouse Spindle in Groningen. In this chapter the research questions are presented. Section 1.1
explains the context for the research problem. The problem and the research questions are dis-
cussed in Section 1.2. Resulting goals from these questions are listed in Section 1.3 as is an outline
for the rest of the thesis.
classification to understand the intention of a user when the user utters some sentence [8, 13, 108].
The IBM sales department claims that Autodesk using chatbots cut down their resolution time
“from 1.5 days to 5.4 minutes for most inquiries” [46].
Various parties run benchmarks and use this to draw conclusions about the best performing
system. Issues can be pointed out which question the validity of these conclusions. A methodology
and three datasets for benchmarking intent classification and named-entity recognition (NER) are
created and published by Braun et al. [10]. NER aims to detect information like dates, locations
and person names from text. The paper compares accuracy for Microsoft LUIS [69], IBM Wat-
son Conversation (now Watson Assistant [3]), Api.ai (now Google DialogFlow [39]), Wit.ai [101],
Amazon Lex [63] and Rasa [8]. When knowing that the field is rapidly advancing [105] it be-
comes clear that the scores from this paper are outdated. Snips [85] show that they outperform
the competition by a large margin [23]. The competition consists of Api.ai, Wit.ai, Luis.ai and
Amazon Alexa (now Amazon Lex [63]). Their small benchmark tests the systems against 70
queries per intent on their own dataset. Snips claim to score 79% accuracy, while the second
best scores 73%. Also, via sentence examples Snips show that some named-entities are classified
incorrectly by systems other than Snips. Although the authors “guarantee transparency” about
the benchmark [24], the dataset could still be cherry-picked. DeepPavlov [13] reports another
high score for intent classification. It is based on the Snips dataset [24] and compared against
Api.ai, Watson Conversation, Microsoft LUIS, Wit.ai, Snips, Recast.ai (now SAP Conversational
AI [1]) and Amazon Lex. Their model uses embeddings trained on the DSTC2 dataset [6, 5].
DSTC2 contains communications with users or ‘calls’ [42]. The dataset includes roughly 500 dia-
logs with 7.88 turns on average in each condition for 6 conditions [42], hence about 20.000 turns
or utterances. Knowing that the focus for Snips also lies in interpretation of voice commands [85]
it is expected that the model created by DeepPavlov does not obtain state-of-the-art results for
other datasets. Botfuel [9] claims to be ‘on par’ with the competition [92]. This is based on
runs on the same datasets as Braun et al. [10]. Botfuel shows it is one procent lower than Wat-
son, equals LUIS and outperforms DialogFlow, Rasa, Snips and Recast.ai. The score for Rasa
matches the score listed by Braun et al. [10]. This means that Botfuel has compared their sys-
tem against an old version of Rasa. These observations give rise to the following research question.
RQ1. Can an open-source benchmarking tool for NLU systems and services be created?
An interesting problem from an academic point of view is increasing accuracy. The second
research question aims to do that.
RQ2. Can the classification accuracy for NLU systems and services be increased?
For the graduation company Dutch datasets would match their use-case, however the focus in
NLP research is on English datasets [14, 105]. To be able to compare our results this thesis will
also focus on English datasets. It is expected that the knowledge from answering this question
can be transferred to Dutch datasets since modern multilingual models exist [88, 89, 32]. The first
and second research question are discussed respectively in Chapter 3 and 4.
RG1. Develop an open-source reproducible tool for benchmarking of NLU systems and ser-
vices.
The first research question is discussed in Chapter 3. NLP and NLU are introduced in detail
in Chapter 2, specifically in Section 2.1. The second research question is discussed in Chapter 4.
To be able to compare the introduced deep learning model with existing models one need to know
about the existing models. Well-known models in NLP are explained in Chapter 2, specifically in
Section 2.2.
Preliminaries
Both research questions are related to natural language processing, which is introduced in Sec-
tion 2.1. Deep learning forms the basis for state-of-the-art systems in the field. Well-known deep
learning models and concepts for NLP are introduced in Section 2.2.
Optical character recognition attempts to recognize characters in images and can use
NLP knowledge to improve accuracy.
Finding co-references like ‘house’ and ‘it’ in the sentence “The house is white, and it is
located on a hill.” is done using coreference resolution.
NLP is not limited to text, because it includes speech recognition which transforms speech
to text.
Entailment classification contains examples such as “People formed a line at the end of
Pennsylvania Avenue.” which is contained in (logically implied by) “At the other end of
Pennsylvania Avenue, people began to line up for a White House tour.” [100]
Determining whether two sentences have the same meaning is called semantic text simil-
arity.
Sentence classification is the broad tasks of classifying a sentence (for example, the sen-
timent or intention of user)
Intent classification, machine translation and named-entity recognition are discussed further in
Section 2.1.2 and 2.1.3.
This is called the bigram model. When using this to generate sentences it becomes clear that
bigrams do not have enough information. Take the generated sentence “I cannot betray a trust
of them.” [59]. Each pair of sequential words is correct, while the sentence as a whole is not. To
improve this systems can be created which use n-grams, where a larger n gives a more accurate
language model. Although n-grams offer good performance for certain cases they are in practise
not able to capture long-distance dependencies in texts [68].
requires a linguistically trained staff” [90]. In an attempt to visualise the progress made in the
field we consider one example translation through time as recorded by Manning and Socher [68].
This ‘one sentence benchmark’ contains one Chinese to English example which is compared with
Google Translate output. The correct translation for the example is:
In 1519, six hundred Spaniards landed in Mexico to conquer the Aztec Empire with a popula-
tion of a few million. They lost two thirds of their soldiers in the first clash.
2009 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the
first two-thirds of soldiers against their loss.
2011 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the
initial loss of soldiers, two thirds of their encounters.
2013 1519 600 Spaniards landed in Mexico to conquer the Aztec empire, hundreds of millions of
people, the initial confrontation loss of soldiers two-thirds.
2015 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the
first two-thirds of the loss of soldiers they clash.
2017 In 1519, 600 Spaniards landed in Mexico, to conquer the millions of people of the Aztec
empire, the first confrontation they killed two-thirds.
One important concept in machine translation is alignment. Alignment refers to the fact that
words in different languages tend to be located at similar parts in the sentence. Consider the
sentences
and
“ja ich denke wenn wir das hinkriegen an beiden tagen acht uhr”.
The first five words of the sentences are perfectly aligned. The last five words of the sentences are
not. Alignment is visualised in Figure 2.1. Non-perfect alignment can also be observed from the
fact that the number of words in English sentence is higher.
Figure 2.1: Word alignment for a German-English sentence pair [96, Figure 1].
In this sentence the intent of the user is to book a ticket. Often the chatbot needs to know more
than just the intent. For this book ticket example the system needs to know the destination of
the user and when the user wants to arrive. Named-entity recognition (NER) can be used to find
this information. A named-entity classifier can be trained to classify London as a destination and
tomorrow as a date. Most systems allow entities to be defined by examples and regular expressions.
The examples can be used for keyword matching by a simple system. More sophisticated systems
use machine learning and language knowledge to not only find exact (keyword) matches, but also
texts similar to the examples.
2.1.4 F score
A common way to calculate system performance for NLU is the F score, or specifically the F1
score. It is based on the confusion matrix, see Table 2.1.
Table 2.1: Confusion matrix for binary classification [86, Table 1].
The confusion matrix can be used to define precision π and recall (or sensitivity) ρ as [86].
tp tp
π= , ρ=
tp + f p tp + f n
|A ∩ B|
R(A, B) := (where R(A, B) := 0 and P (A, B) := 0 for B = ∅), and
|B|
P (A, B) × R(A, B)
Fβ (A, B) := (1 + β 2 ) .
β 2 P (A, B) + R(A, B)
Then the metrics are defined as [84]:
1 P
Fβ−macro = Fβ (yl , ŷl )
|L| l∈L
1 P
Fβ−weighted = P l∈L |ŷl |Fβ (yl , ŷl )
l∈L |ŷl |
In this thesis the implementation by Scikit-learn version 0.20.0 [82] is used where possible.
Figure 2.2: Recurrent neural network unfolded in time [61, Figure 5].
texts. For each step it is asked to predict the output word ot based only on all k previous words
xt−1 , xt−2 , . . . , xt−k . The prediction is compared to the correct word, if these are not equal the loss
is backpropagated. The backpropagation is then able to ‘change history’ to improve prediction
accuracy.
The benefit of this architecture over n-grams, as presented in Section 2.1.1, is that the inform-
ation is compressed inside the neural network. Also, there is a certain sense of importance since
the weights are not uniformly updated for all previous states (words). Take, for example, ‘the car,
which is green, broke’. For this sentence the word ‘broke’ can more easily be predicted based on
‘the car’ than on ‘which is green’. It is found that RNNs are able to capture this importance [68].
In practice RNNs are not using the complete history. The cause for this are vanishing gradients.
“In practise gradients vanish and simple RNNs become like 7-grams for most words” [68].
words y1 , y2 , . . . , yT 0 based on the previous decoder states and C. This is visualised by the arrows.
The authors of the paper recognize that this approach has the same limitations as the basic
RNN. Typically words in translation sentence pairs are aligned, as described in Section 2.1.2. For
example, when generating the first word y1 the algorithm mainly needs info from x1 , but has
more recently seen the next words in the sequence x2 , x3 , · · · , xT 0 . Vanishing gradients will cause
the network to forget its far history, so based on the 7-grams claim RNNs are only effective for
translating sentences shorter than 7 words. To solve this the authors [17] have introduced gates to
RNNs. To improve this gated recurrent architectures such as long short-term memory networks
(LSTMs) [44] and gated recurrent units (GRUs) [17] have been developed.
GRUs have gates which automatically learn to open and close for some hidden state. This can
be visualised by looking at the information which is passed through the states. The information
is captured in a matrix. In a RNN the information in the entire matrix is updated in each step.
GRUs learn to read and write more selectively, see figure 2.5. For some point in time the update
consist of reading a specific subset and writing a specific subset of the matrix. In effect the
network learns to look only at the word it needs [68]. For example when translating a sentence
from German to English it will look at the verb in German to come up with the verb in English.
Writing, in effect, lets the model allocate specific parts in the matrix for specific parts of speech
(for example, nouns and verbs).
Figure 2.5: Simplistic visualisation for updating the hidden state in a GRU.
LSTMs [44] are similar to GRUs. An LSTM does not only contain update and reset gates but
also uses a internal memory state. In practise LSTMs take longer to converge than GRUs [19],
but remember around 100 words where GRUs remember around 40 words [68].
The meaning of the word ‘drinking’ changes after reading the next words in the sentence. To take
this into account bidirectional recurrent neural networks (BRNN) have been developed by Schuster
and Paliwal [80]. A BRNN contains two separate RNNs as depicted in figure 2.6. The paper only
considers RNNs, but the method can be applied to gated recurrent models as well [76]. One RNN
goes through elements of the sequence from left to right and the other in the reverse direction.
Training can be done by showing the network all words except for one. Both networks learn to
predict the next word given a sequence of words. Calculating the loss is done by taking the average
of the predictions of both RNNs. To reduce required computational power one simplification is
used. Suppose we want to learn from the word at location k, xk , and stepped through many states
to reach sk−1 and s0k+1 . Here we let the model make a prediction for yk . Then we update the
weights and, assuming we go forward, now want to learn from xk+1 . The RNN in state sk−1 takes
one step forward, but the RNN in state s0k+1 has to restart from the last word in the sequence. To
solve this both RNNs make one prediction for each word and the answers of both for the entire
sequence are batched and used to update the weights.
2.2.6 ELMo
Word embeddings generated by well-known models such as Word2vec [70] and GloVe [75] do
not take context into account when determining word representations. For ELMo “each token is
assigned a representation that is a function of the entire input sentence” [76]. This is done by using
a bidirectional LSTM. Word embeddings are used only to map words to vector representations.
To improve accuracy further, compared to traditional embeddings, the authors advise to use ‘the
deep internals of the network’ [76]. These internals can be used for downstream models, also
known as transfer learning. For example, the authors show that word sense disambiguation tasks
are captured by higher level LSTM states. Part-of-speech tagging or named entity recognition are
captured by lower level states. ELMo is not the first system to use context, but was obtaining
state-of-the-art empirical results on multiple non-trivial tasks at the time of publication. Another
reason for the good results is that the system is character based. Word based systems cannot
generate an embedding for a word they have not seen during training (out-of-vocabulary tokens).
In character based systems morphological clues can be used to guess the meaning of the out-of-
vocabulary words. The system has quickly become very popular. Reasons for this seem to be
the high accuracy, that the system generalizes well, and that it is integrated into the AllenNLP
open-source NLP library created by Gardner et al. [36].
2.2.7 Transformers
The main issue in the recurrent approaches is that distant information needs to pass through all
the intermediate states. In the basic RNN for each state all the information is updated, causing
less recent information to gradually disappear. Gated recurrent architectures (GRUs and LSTMs)
reduce this problem by being more selective when viewing or changing information. Transformer
networks allow the model to look at previous inputs instead of previous states [95]. For example,
suppose we are translating ‘impounded’ in the following sentence pair from the WMT’14 English-
German dataset.
The motorcycle was seized and impounded for three months.
Suppose the system has correctly predicted the German sentence up to and including ‘Monate’.
The next step is to predict ‘beschlagnahmt’. To do this the system needs mainly information
about the word ‘impounded’. Gated recurrent architectures learn to look at the previous state
in such a way that the attention is focused on ‘impounded’. This requires the information of the
word to not been overwritten during execution.
Transformers evade this overwriting problem by allowing the system to see all d previous words,
where d is 1024 for the biggest model. The only thing the transformer then needs to learn is where
to focus its attention. The information of all the previous words is stored in an array of word
vectors (a tensor). To apply focus to parts of this tensor the model learns to put a mask over the
tensor. In the mask hidden items are multiplied by infinity [95]. One drawback of this architecture
is the required computational power. Suppose we only need one word from the d previous words.
The mask will hide d − 1 words. This still requires to multiply the masked word vectors by
infinity. Google argues that this is not really an issue since matrix multiplication code is highly
optimized and graphic processing units (GPUs) and tensor processing units (TPUs) exist. So,
the model can relate any dependency in constant time when the range is lower than d. This in
contrast to recurrent layers which are in linear time. When the sequence length is greater than d,
computations will require more than linear time [95].
Another benefit of the transformers are that self-attention visualisations can more easily be
done than in recurrent architectures. By self-attention the authors refer to attention that is used
to generate context aware word representations. An example of a tranformer model correctly
applying coreference resolution is shown in figure 2.7.
Benchmarking
This chapter aims to answer the first research question. The goal is to create a reproducible
benchmark tool, simply called bench. Datasets used by the benchmark tool are described in
Section 3.1. The benchmarked systems are described in Section 3.2. bench is described in
Section 3.3 as are results obtained by running the benchmarks. It has been observed that bench
needs further requirements to be more useful, these requirements are described in Section 3.4.
Notes on using the tool are presented in Appendix 3.3.
3.1 Datasets
Trained models are considered black boxes at the time of writing. To verify performance of a
model data is required. This is fed to the model and results are measured. This section will
describe the datasets used for benchmarking.
3.1.1 Format
The dataset format needs to be able to specify a sentence annotation and subsentence annotation.
One often used dataset for subsentence, or token, classification is CoNLL-2003 [91]. It uses the
NER task definition as described by Chinchor et al. [16]. The definition uses tags to specify
entities, for example:
Note that this constraints the text since angled brackets (‘<’, ‘>’) cannot be used without escaping
the brackets. A less verbose and non text constraining annotation standard is the BIO2 annotation
standard. The origin is unclear, but adaptation is done by at least Stanford as seen in the GloVe
paper [75]. Here sentences are annotated as follows.
I B-Person
and O
John B-Person
Doe I-Person
worked O
yesterday B-Date
. O
where B, I and O respectively mean begin, intermediate and ‘empty’ annotation. Note that other
annotations, like part-of-speech tagging, are possible by adding another column of tokens. A
benefit of this annotation is that measuring performance can be done by looking at each token
annotation separately. In a tag example like above it is unclear how cases where the classifier
is partly correct should be solved. Suppose only ‘susan’ is classified as person and not her last
name ‘jones’. Metrics now have to decide how to handle this partially correct situation. In the
BIO2 annotation standard token classifications can only be correct or incorrect. One drawback is
that the standard is not easy to read for humans. A more readable format is the Rasa Markdown
format. Here it constraints the text by using square (‘[’, ‘]’) and round bracket (‘(’, ‘)’) symbols
to denote annotations. For example:
[I](person) and [John Doe](person) worked [yesterday](date).
Unlike BIO2 this standard does not easily allow to also specify other annotations, for example part
of speech tagging. One could argue that the readability of a Markdown format and the versatility
of the BIO standard show that there is no single best approach.
A combination of sentence annotations and token annotations is not supported by the standards
described above. For BIO one could track the sentence annotations in a separate file or put it
before or after the sentence. The former adds duplicate information while the latter makes the
file incompatible with the standard. For Rasa one could change it to a tab separated file and put
the Markdown and sentence annotation in separate columns. This is very readable and compact,
but means transforming the multi-token annotations to separate token annotations for easier
validation. Datasets which combine sentence annotations with token annotations seem to take
yet another approach. They use json to store any information they have. These formats should
allow for easier parsing, but can not easily be read by humans. For example one dataset annotates
entities as as follows.
"entities": [{
"entity": "StationDest",
"start": 4,
"stop": 4,
"text": "marienplatz"
}]
This entity belongs to the sentence “i want to go marienplatz”. ‘start’ and ‘stop’ here assume the
sentence to be tokenized using a WordPunctTokenizer [7] having regexp \w+|[^\w\s]+. Drawbacks
are that the entity text is duplicated and that the datasets are hard to read for humans. The
sentences have to be manually tokenized for verification and the number of lines of the dataset is
an order of magnitude higher than the Markdown format.
Table 3.1: Number of labeled train and test sentences and unique intents and entities per dataset
incentive for sharing these datasets seems to be showing that their system performs better than
other systems. Two datasets have been published by Snips. The thesis has only used the 2017
version and not the 2016 version. The 2017 version will from now on be referred to as Snips2017.
Sentences in this dataset are typically short. They utter some command to the system, for example
for the intent ‘PlayMusic’: “i want to listen to [Say It Again](track) by [Blackstratblues](artist)”.
Full datasets can be inspected at Github2 . Summary statistics for these datasets are listed in
Table 3.1. In this table ‘None’ is not counted as an intent. The reason for specifying this is that
falling back to null or some intent during unsure predictions result in different scores for most
metrics. F1 score calculations, for example, do not ignore nulls or ‘None’, but instead consider
them as a separate group. Information about the unique number of entities for Snips2017 is not
specified by the dataset authors.
3.2 Systems
Two open-source systems and some cloud services are considered for the benchmark. The open-
source systems are described in Section 3.2.1 and 3.2.2. Cloud services are described in Sec-
tion 3.2.3.
3.2.1 Rasa
Rasa [8] is an open-source system allowing users to build conversational agents. The systems
consists of two parts, namely rasa nlu and rasa core. The former classifies sentences and
subsentences. To train the system users can specify (hierarchical) intents, synonyms and regular
expressions. Hierarchical intents is a recent addition which allows the system to extract multiple
intents from a sentence. For example, it can extract ‘hi+book ticket’ from “Good morning. Can
I order a ticket to London please?”. Regular expressions can, for example, be used to detect
numbers and dates in fixed formatting. The system is actively used in production. As a result the
code is well documented and stable.
rasa core aims to handle dialogue management. This is an extension on the classifiers of
rasa nlu which aims to understand text in context. Also, it can be used to specify conversation
flow. This part remains one of the most difficult problems for conversational agents. Humans
tend to switch rapidly between topics in conversations. For example, suppose one ticket order
conversation flow contains six questions to be answered by the customer. Customers expect to
be able to switch topic during each one of these questions and then return to the flow. Enabling
this behaviour via state machines or flowcharts is cumbersome, because the number of transitions
grows quickly. One of the Rasa solutions is applying machine learning to let developers train
dialog flows interactively.
Rasa can be used via the API or via the ‘Python API’. The Python API is the most efficient.
Here users install rasa nlu in their programming environment and call functions directly. De-
pending on the used configuration a selection of dependencies have to be installed. The regular
API advises use a Docker container. This is less efficient, but more modular and does not require
to install dependencies. Containers are published to Docker Hub by Rasa. Users can pull these
for free and use the newest stable configuration of choice.
2 https://github.com/rikhuijzer/nlu_datasets
Configurations are defined as a pipeline. Pipelines specify what components should be applied
to sentences and in what order. Typical pipelines contain at least a tokenizer followed by some
classifier. Pipelines are meant to be modified easily and are specified using configuration files.
In practise default pipelines often suffice for end-users. A back-end refers to the used intent
classifier in some pipeline, for example ‘tensorflow’. Three back-ends are offered by Rasa via
Docker Hub, namely rasa-mitie, rasa-spacy and rasa-tensorflow. rasa-mitie is the oldest
and depreciated. Training MITIE (https://github.com/mit-nlp/MITIE) takes at least a few
minutes for small datasets. This is caused by the fact that it is tuning hyperparameters during
training. On two different computers used for this thesis the MITIE Docker Hub image occasionally
hangs on various datasets. rasa-spacy is the successor of MITIE and based on spaCy [87]. In
2015 spaCy was in the top 1% for accuracy and the fastest syntactic parser [18]. spaCy (and by
that rasa-spacy) uses a language model to parse text. It includes seven language models for
which English, Spanish and French include word vectors. The multilingual model supports only
named entities. Unlike the other two back-ends rasa-tensorflow is not based on a pre-trained
language model. This is like classifying sentences in an unfamiliar language (say Chinese) after
only seeing some examples. Rasa advises to use this back-end when training data contains more
than 1000 training examples. The benefit of rasa-tensorflow is that the back-end is language
independent and can handle domain specific data and hierarchical intents.
3.2.2 DeepPavlov
DeepPavlov [13] is similar to Rasa. Unlike Rasa, DeepPavlov aims to aid researchers in develop-
ment of new techniques for conversational agents. Being a newer system than Rasa and aimed
at researchers the system is not yet production ready. The system does only provide a Python
API, requiring Python 3.6. One claimed benefit of the system is that they do not export ma-
chine learning components from other systems. A reason why the system is not well suited for
production is that pipelines can download information. This means that a generic Docker needs
to download many megabytes of data for each time the Docker is started. Manually defining new
Dockers holding this information is possible, but does require some knowledge about Docker and
some time to set it up. For users who want to use only a few training examples pre-trained models
are necessary. DeepPavlov by default includes DSTC 2, Wikipedia, Reddit, RuWiki+Lenta and
ELMo embeddings.
wit.ai (https://wit.ai) can be used for chatbots and is acquired by Facebook. The system is
free to use, but Wit is allowed to use the data sent to their servers.
Pat (https://pat.ai) has as goal to humanize AI and provides some conversational agent ser-
vices.
kore.ai (https://kore.ai) focus lies on intent classification and entity extraction with as goal
to replace graphical user interfaces with chatbots.
3.3.1 Overview
Since the code does not contain classes, the high-level overview is simply a tree-like structure. This
is analogous with a book, where subsections are contained in sections and sections are contained
in chapters. In the code small functions are called by larger functions and these larger functions
are called by even larger functions. For an overview this idea can be generalized to modules. An
overview for the modules of bench is roughly as follows. The ‘.py’ suffix is omitted for all elements
in the tree. Tests are also omitted.
bench
– src.utils
– src.typ
3 https://github.com/rikhuijzer/bench
– src.dataset
* src.datasets.corpora
* src.datasets.snips
– src.system
* src.systems.amazon lex
* src.systems.deeppavlov
* src.systems.dialogflow
* src.systems.mock
* src.systems.rasa
* src.systems.watson
– src.evaluate
* src.results
Some generic functions are listed in src.utils and used through the entire project. The
project makes use of type hints as introduced in Python 2.7. All NamedTuples or ‘types’ are
defined in src.typ. An overview of the most important types is depicted in Figure 3.1. These
types also contains enumerables or ‘Enums’. These are used in cases where function behaviour
depends on some parameter having a fixed set of options. Alternatively one could use strings
for these cases depending on user-preference. Notable is the usage of System, and by that the
usage of Corpus, in Query, SystemCorpus, Classification and F1Score. This is caused
by the fact that external systems (for example, DialogFlow) have a certain state which needs to
be passed through many functions. This state could be that the system has not yet seen the
dataset, resulting in Corpus.EMPTY, or the system has trained on AskUbuntu, which results
in Corpus.ASKUBUNTU. In, for example, Classification this is used to let some evaluation
function know context for an input sentence. This context includes from what dataset the sentence
came and what system has classified the sentence.
System- Classi-
Query F1Score
Corpus fication
Figure 3.1: Overview of most important type classes (NamedTuples and Enums) and their relations
in bench. Here a line from A to B means that B is a type which includes A (B extends A).
The real work of the project is done by src.dataset, src.system and src.evaluate, as
shown in Figure 3.2. ‘Dataset’ takes input files and converts them to an internal representa-
tion as defined by src.typ. Input files here denote the original dataset files as created by the
dataset publishers. For the internal representation a Rasa Message4 is used. The benefit of
this is that it avoids defining the same structure and that it can be used in combination with
Rasa code. For example, src.dataset.convert message to annotated str uses Rasa code
to print the internal data representation as a sentence in Markdown format (Section 3.1.1). Next,
the data reaches src.system. Here it is passed to the system under consideration, either in
training or prediction mode. For the predictions this is done by finding out which function can
convert src.typ.Query to src.typ.Response. When, for example, Rasa is under considera-
tion the function src.systems.rasa.get response is called. DeepPavlov would be handled by
src.systems.deeppavlov.get response. PyCharm is known to have the best type inference
for Python. The IDE is not yet able to infer function type for a function mapping, even when
all functions have the same input and output type. A workaround is to manually define the type
of the function returned by the mapping as func: Callable[[tp.Query], tp.Response] = · · · .
src.evaluate takes all responses tp.Response evaluates the performance of the system under
consideration. Printing F1 score is a matter of three functions and about a dozen lines of code. At
one point more advanced logging has been included which is responsible for the other 12 functions
and 110 lines of code.
dataset
src.dataset
src.system system
src.
evaluate
F1 score
Table 3.2: Micro F1 scores for intent classification. One score is missing due to a bug in bench.
The paper remarks that “For our two corpora, LUIS showed the best results, however, the
open source alternative RASA could achieve similar results” [10]. When considering only intents
this does not hold. Watson Conversation has very similar results, and in fact slightly higher scores
on two out of three datasets. The MITIE back-end outperforms the spaCy back-end in terms of
accuracy. This would not support the choice of Rasa to depreciate MITIE. It is expected to be
caused by the facts that training MITIE takes more time than spaCy and MITIE tends to freeze
during training. Interesting to see is that the accuracy for Watson Conversation has dropped. The
cause can only be guessed since IBM does not provide information about the Watson back-end.
It could be that the calculations for bench and the paper differ. Alternatively it could be that
the back-end for Watson has changed. The datasets under consideration are small, so it might be
that Watson has chosen a back-end better suited for large datasets. Note that IBM is aimed at
large companies. These companies have the resources for creating lots of training examples.
3.4 Observations
bench requires some further improvements, as explained in Section 3.4.1. Tool design was guided
by the methodology as presented by Braun et al. [10]. Section 3.4.2 points out some observations
which could improve the the proposed methodology.
constantly switching systems they should also take other factors into account. These factors can
include pricing, memory usage, classification speed, privacy (whether open-source) and in-house
API preferences.
In summary, we define the following requirements.
The tool should be continuously maintained,
offer an API key for each service, or let users add their own keys,
3.4.2 Methodology
Creating a benchmarking tool has resulted in more insight into intent classification. This helped
in identifying improvements for the methodology proposed by Braun et al. [10]. As discussed in
Section 3.1.2 falling back to ‘None’ or a random intent changes F1 score. In the paper Chatbot
does not have a ‘None’ intent, while WebApplications and AskUbuntu do. Furthermore, drawn
conclusions about some system being more accurate than others seems insubstantial. The con-
clusion that accuracy of some system depends on the domain seems convincing, but is poorly
grounded. Reason for this is that both conclusions are based on the F1 score.
In this paper, the F1 score is calculated using micro F1 score. Such a score does not take
classes of different size, so called class imbalances, into account. This is combined with a situation
where intents and entities are given the same weight. For WebApplications there are in total 74
labeled intents and 151 labeled entities. AskUbuntu contains 128 labeled intents and 123 labeled
entities. So, when using micro F1 on AskUbuntu the score is based somewhat equally on intents
and entities. For WebApplications the score is based for about one thirds on intents and two
thirds on entities. This could mean that some system has scored significantly better that others
simply because it labels entities in WebApplications particularly well. Another observation is that
users interested in either intent or entity classification are not well informed. Better seems to be
using macro or weighted F1 and reporting separate intent and entity scores. Macro averaging (and
by that weighted averaging) is better suited to “get a sense of effectiveness on small classes”[81].
Rasa, for example, uses the weighted average in their evaluation according to their code on Github.
Another reason for caution with regard to the presented F1 scores is the probabilistic nature of
neural networks. Although inference (classification) is deterministic, training is not. During train-
ing models often start with random weights. Random initializations can move into different local
minima for the same training data. This could change the inference results. During benchmarking
this effect has been observed for Rasa using the spaCy back-end. According to a mail from the
main author Rasa 0.5 with the MITIE back-end is used for the results described in the paper. The
MITIE back-end has not shown to change accuracy after re-training the model. Microsoft LUIS
and Google Dialogflow also did not show a change in accuracy after re-training. So, it could be
that all the systems in the benchmark were deterministic. Still, the problem of not considering
the possibility of changing accuracy persists for the methodology.
Improving accuracy
The goal is to improve the classification accuracy for natural language understanding, specifically
intent classification. A search is conducted in Section 4.1 to find ways to improve the accuracy.
BERT is deemed to be the most promising and is discussed in Section 4.2. The section about
BERT describes the model and how it is implemented for this thesis. Accuracy scores for BERT
are listed in Section 4.3 and compared to a baseline.
4.1 Search
The field of NLP is rapidly evolving due to the introduction of deep learning [14]. Systems which
obtain state of the art (SOTA) accuracy are often surpassed within a few months. A recent example
of this is ELMo [76] as published in March 2018 which has been surpassed [105] by BERT [32]
in October 2018. A search is conducted to improve the accuracy of the existing systems. The
research for this part has not been systematic. The method for finding an improvement is based
on coming up with ‘novel’ approaches to improve accuracy. After having such an ‘novel’ approach
the literature is consulted. This search method relies on the assumption that papers have done
their research and will provide proper related work. This section will explain the considered ideas
and related literature.
achieved by establishing an emotional connection with the user. This system which is created by
a research lab and being used by 660 million users does not automatically use the data to learn.
The authors manually optimize the engagement of the system. From this it is concluded that
reinforcement learning and automatic data wrangling are not yet feasible approaches to increase
accuracy.
4.1.3 Kaggle
Kaggle1 is a well-known site in the machine learning domain. On this site a framework exists where
datasets can be published. The site, along other things, shows statistics, a comments section and
a scoreboard. It is famous for hosting ‘competitions’, where the person or team obtaining the
highest accuracy for some task gets price money from the dataset hoster. Kaggle provides a way
for machine learning enthusiasts to communicate. People who obtain top 20 high scores on difficult
tasks tend to explain their pipeline in a blogpost. Research papers tend to focus on designing the
best deep learning architectures. The Kaggle explanations are valuable sources for learning how
to get the most out of the architectures.
One such post [57] uses three embeddings, namely Glove, FastText and Paragram. The author
argues that “there is a good chance that they [the embeddings] capture different type of information
from the data”. This method is called boosting. Predictions from the embeddings are combined
by taking the average score. A threshold is set to remove answers where the model is unsure.
This method could be used to improve performance for natural language understanding. Running
three systems in parallel does increase the training time, but the difference is not too large. It
would be interesting to test whether averaging can be replaced by a more involved calculation.
Meta-algorithms such as boosting, bagging and stacking are not investigated further since the
improvement is expected to be insignificant.
4.1.4 Meta-learning
Meta-learning is “is the science of systematically observing how different machine learning ap-
proaches perform on a wide range of learning tasks, and then learning from this experience, or
meta-data, to learn new tasks much faster than otherwise possible” Vanschoren [94]. Few-shot
learning aims to learn useful representations from a few examples. In practise most intent clas-
sification systems use few examples, so few-shot learning is interesting to the research question.
This was also concluded by the IBM T. J. Watson Research Center [106]. The authors show that
their system outperforms other few-shot learning approaches. They do not compare their system
against natural language understanding solutions and conclude that their research should be ap-
plied to other few-shot learning tasks. This implies that natural language understanding specific
systems obtain higher accuracies. Automatically tuning hyperparameters as done in TensorFlow’s
AutoML is based on the work by Andrychowicz et al. [2]. Industry claim that AutoML obtains
95% of the accuracy of hand-tuning hyperparameters. Another problem is that it does not scale
well [49]. Transfer learning approaches like MAML [34] and Reptile [72] could be useful for intent
classification as well. Different domains require different models. Reptile seems interesting to be
used to train a model on one domain and then be able to easily switch the model to other domains.
This would introduce a lot of complexity in the code. More convenient would be using a model
which works on all domains.
4.1.5 Embeddings
Embeddings capture knowledge about language and use that for downstream tasks. There appears
to be a consensus about the timeline of embeddings evolution. GloVe [75] was superseded by
FastText [50]. The Facebook FastText embedding is aimed to be quick, allowing it to be used as
a baseline. With 157 languages (https://fasttext.cc/) it is a mutli-lingual model. Another
1 https://www.kaggle.com
state-of-the-art and easy to implement embedding is the universal sentence encoder [15]. The word
‘universal’ denotes that the system has used a supervised training task which has been chosen such
that the embeddings generalize to downstream tasks. Not only Google, but also Microsoft research
is working on multi-task learning [89]. These embeddings are not enough to improve on existing
systems, since Rasa is using the universal sentence encoder [99]. One step further would be to let
the model decide what embedding it wants to use [56]. A caveat is the fact that one then needs
to implement multiple embeddings (even when the model decides that only one embedding should
be used).
4.2 BERT
At the start of October 2018 Google published their NLP model called Bidirectional Encoder
Representations from Transformers (BERT) [32]. The authors show it is able to score state-of-
the-art (SOTA) results for eleven NLP tasks. A comparison by Young et al. [105] shows ELMo [76]
outperforms various SOTA models on six distinct non-trivial NLP tasks. The comparison [105]
continues by showing that BERT gets higher accuracy scores than ELMo for all six tasks. This by
transitivity means that BERT obtains the highest accuracy scores at the time of writing. BERT
being SOTA is also supported by a maintained scoreboard for the Stanford Question Answering
(SQuAD) dataset [79].
Figure 4.1: Comparison of flow of information in the layers of various recent pre-training model
architectures [32, Figure 1]. Note that “only BERT representations are jointly conditioned on
both left and right context in all layers” [32].
Another benefit of BERT is that they provide a wide range of pre-trained models. The basic
models presented in the BERT paper are BERTBASE and BERTLARGE . BERTBASE is “chosen to
have an identical model size as OpenAI GPT for comparison purposes” [32]. BERTLARGE obtains
higher accuracy on most tasks and has 340 million parameters in total. Compared to BERTBASE
this is an increase from 110 million to 340 million parameters. The BERT Github repository [28]
lists some more models, namely uncased and cased variants for BERTBASE and BERTLARGE .
In general uncased models suffice, but for certain tasks (for example, NER) performance can be
increased by using a cased model [28]. Also, they provide BERTMULTILINGUAL and BERTCHINESE .
The multilingual model is trained on the 100 languages having the most Wikipedia pages [31].
4.2.2 Training
From now on training is used to denote fine-tuning of the model. Training the general language
model on some downstream task is presented as being inexpensive [28]. Relative to the pre-training
it is. Experiments show that fine-tuning with default hyperparameters will run out of RAM on
a 16 GB RAM machine. Lowering the batch size reduces the memory usage, but running a few
training steps still takes at least a few hours.
To train the model on some tasks it is advised to run “a few epochs” [28]. Based on the example
code provided by Google researchers the number of epochs is 3 and the number of training examples
is about 1000 [4]. So, it is advised to show the system 3000 examples. For our smaller datasets of
around 50 examples this means running 3000/50 = 60 epochs. When measuring the training time
in steps it means running 3000/16 ≈ 188 steps for a batch size of 16. Preliminary experiments
on the AskUbuntu dataset (having 53 training examples) with a batch size of 32 confirm this
estimate, see Figure 4.2. The images show that the system does not converge smoothly, and can
even have a sudden drop in performance. One possible explanation for the performance drop is
that the model moved into an non-generalizing local minimum.
The results are interesting because it shows that the model is able to learn something even
for a dataset with only tens of training examples. Training the model for 5 steps or 80 examples
takes at least a few hours on a modern computer. Interpolation suggests that training 188 steps
will take at least 36 hours. This is impractical when doing experiments.
According to the paper the benefit of the Transformer models is that they are highly paralleliz-
able. Training BERT consist mainly of matrix multiplications [27]. These can be done quickly and
efficiently on graphic processing units (GPUs) and tensor processing units (TPUs). The latter are
ASICs created by Google specifically to do machine learning inference [51] and contain 64 GB of
RAM [28]. When using the TensorFlow implementation of BERT, GPUs with 16 GB of RAM are
required [28]. GPU optimizations are available in the PyTorch implementation provided by Wolf
et al. [102], but PyTorch does not support TPUs at the time of writing. Prices for these GPUs
are at least a few thousand euros, which means most users and companies resort to cloud services.
Google Colab Google [38] provides free access to a GPU and TPU instance. Code which uses
Google Colab for BERT is based on an example implementation provided by Google [4].
Using Colab is a compromise between usability and costs. The costs are limited to the use of
some storage in a Google Cloud Bucket. Usability is hindered by the usual constraints of online
Jupyter Notebook editors, for example no unittests, no autocomplete and poor Git integration.
To overcome these issues most of the code is written and tested locally and pushed to a Github
repository called improv, see Appendix E. In the Colab the code is then pulled from the repository
and main functions are called. Using Colab has benefits as well. Hyperparameters and output
are visible in one document and can easily be modified in the Notebook, this eases verification.
Reproducibility is possible by opening the Notebook and running all cells. The first cell will ask to
link the Colab to a Google account, make sure this account has access to a Google Cloud Bucket.
The plots in Figure 4.2 are created using the default TensorFlow visualisation tool TensorBoard.
Generating these plots can be done by specifying a model and metrics using the TensorFlow
Estimator API. The plots will not be generated for the rest of the runs for reasons explained
in Section E.2. For the rest of this document all results are for the BERTLARGE model since
BERTBASE is only created for a fair comparison with OpenAI GPT [32].
2. Regular expression based intent and entity featurizer (for example able to featurize phone
numbers),
For this pipeline the Stanford Named Entity Recognizer and scikit-learn classify separately.
Preferably one would have one model which could learn to do the entire pipeline, also known as
an end-to-end model. End-to-end models have two benefits. Firstly, an end-to-end model avoids
feature engineering and data pre-processing [67]. Secondly, end-to-end models can obtain higher
accuracies because (semi-)optimal features are found automatically.
That the combination improves independent models has been shown by Ma et al. [66] and Zhang
et al. [107]. The results for the former are obtained by using a LSTM network. The latter
introduces an algorithm to combine hidden states from an LSTM. They show this for the more
general problem of sequential labeling and classification. Intuitively the improvement was to be
Figure 4.3: Two of the four single sentence tasks presented in the BERT publication [32, Figure
3].
expected for the following reason. Suppose we are trying to classify a dataset which contains the
sentence:
The sentence has intent ‘BookFlight’. Training the model could be simplified by providing the
sentence classifier with:
Now the model does not have to learn to classify sentences while also learning that London is
a location and that tomorrow is a date.
Note that an end-to-end model is preferred over two separate models. At the time of writing
NER classifiers do not obtain perfect accuracy. This means that some classifications will be
incorrect. The example from above could instead be converted to:
This could make the intent classifier drop in accuracy. In an ideal end-to-end model incorrect
NER classifications would be less of an issue. The model would learn to ignore the named entity
recognition if it would not increase accuracy.
non-entity words in the sentence it can use A2 , A3 , · · · , An . Typically sentences are much shorter
than 128 tokens so enough space should be available in A2 , A3 , · · · , An . To allow for more space
the max sequence length can be increased, this will increase training and inference time.
To do this the input for the model has been changed from:
text: ['how', 'do', 'i', 'disable', 'the', 'spam', 'filter', 'in', 'gmail', '?']
true: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-WebService', 'O']
to
text: ['INTENT', 'how', 'do', 'i', 'disable', 'the', 'spam', 'filter', 'in',
'gmail', '?']
true: ['FilterSpam', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-WebService', 'O']
4.3 Results
Experiments are conducted on the AskUbuntu, Webapplications, Chatbot and SNIPS2017 dataset
as introduced in Section 3.1.2. Comparisons are made for a fixed number of steps (or equivalently
epochs). The reason for this is that intermediate results are not easily reported for the BERT
model as explained in Section 4.2.2. The number of steps to be used for training is based on a
guess on what should be enough. Validating whether the model should have been trained for more
epochs can be done by looking at the loss reported in Appendix F for different number of steps.
For each dataset various runs for BERT are executed. During one run only intents are shown to
the system and accuracy is measured for intents. Another run only shows entities and measures
intents. A third run shows the system intents and entities and measures both. These methods
are denoted as separate or joint. It is expected that the joint training increases accuracy for both
intent and entity classifications. The reason for this is that the model sees more varied data and
hence should be able to more easily find a good internal representation of the data. Results are
listed in Table 4.1. Further experiments on the near zero score on Snips2017 separate intent are
located in Section E.3.
Note that separate training consists of two runs, and hence ran twice as many epochs. This
seems fair, since intent or entity improvements which require twice as many training steps are not
interesting for expensive models such as BERT. As a baseline Rasa 0.13.8 with the Spacy pipeline
is used. At the time of writing only intent classification is implemented using the benchmark code,
so entity scores are missing. Omitting scores for other systems has been deliberate. The table
is merely meant to support that joint training is feasible. A final remark is that the scores have
been rounded to two decimals. The number of epochs is calculated by taking number of training
steps times training batch size and dividing by number of training examples. The training time
for 600 steps is around 10 minutes, for 6000 steps it is around 20 minutes. The reason for the
relatively small increase in training time is that most time is spent on training preparations and
Table 4.1: Weighted F1 accuracy scores (mean ± standard deviation, over three runs) for separate
and joint training on four datasets. The NER accuracy calculation is based on the FullToken-
izer [29]. Details are listed in Appendix F.
From the results it can be concluded that joint training is feasible. Joint training increases
the named-entity recognition accuracy for each dataset. For intent classification joint training
significantly increases accuracy compared to separate training. Compared to the rasa-spacy
baseline it performs better or similar on all the small datasets (WebApplications, AskUbuntu,
Chatbot). The accuracy is slightly lower for Snips2017. The model accuracy will be near zero
when training on intents separately for SNIPS2017 and the model accuracy will vary for Chatbot.
For Snips2017 it has been found that lowering the step size does not solve this problem, as listed
in Table F.4. One reason for the poor performance on intent classification only is expected to
be that the model get stuck in an incorrect local minimum. This might be explained by the fact
that joint training examples are more varied. A typical sentence contains 12 tokens. Then a joint
training batch of size 32 will contain about 3 tokens related to intents and 29 related to entities.
For an intent training batch of size 32 it will contain 32 tokens related to intents. Hence, the data
for the joint training is much more complex. This seems to indicate that the joint training forces
the model to learn a more complex representation.
An important thing to note about the results is that the datasets are very small. One would
expect that the large BERT model is better suited for datasets which contain more training
examples. Furthermore, the experiments are based on a basic implementation. For intent the
model as defined by Devlin [28] is used. For named-entity recognition and joint training the model
by Kaiyinzhou [53] is used. Not only the batch size but also other hyperparameters can be tuned
for better results. Training has used a fixed number of steps or epochs. It might be that more
epochs give higher accuracies. On the other hand it might also be that less epochs correspond
to similar accuracies in less training time. Other interesting hyperparameters are max seq length
and learning rate. Lowering the former to the expected maximum number of tokens in sentences
reduces training and inference time.
Observe that joint training generalizes to sequential labeling and sentence classification. In
other words, any combination of tasks where sentences are classified as well as parts of the sentence
and these tasks are correlated. The tasks, of course, are expected to be correlated for any real-
world NLP dataset. Also, it seems that joint training can be used for multiple sequential labeling
tasks (for example, NER and part-of-speech tagging) combined with a sentence classification task.
Further work is needed to validate this.
Conclusions
Recent well-known artificial intelligence (AI) examples, such as Apple Siri, suggest that difficult
natural language processing (NLP) tasks can be automated. The graduation company is inter-
ested to see whether this can be used to automate customer support. To this end various NLP
tasks have been considered. Eventually it is decided that intent classification and named-entity
recognition are an interesting combination of tasks for the graduation company. This combination
of tasks is called natural language understanding (NLU) and often used in chatbots to respond to
users in real-time. While reading about this task it was found that various parties run benchmarks
and draw conclusions. For each party an issue which affects the validity is found. This gives rise
to the following research question and goal.
RQ1. Can an open-source benchmarking tool for NLU systems and services be created?
RG1. Develop an open-source reproducible tool for benchmarking of NLU systems and ser-
vices.
The answer for the first research question is that it is possible. However, stronger requirements
are needed to make the tool more useful. The tool is required to be continuously maintained to
adapt to changing APIs for NLU services and software. It would require to offer an API key, or
let users add their own keys. Furthermore it should offer better metrics. Deciding on a product
depends not only on accuracy, but also depends on at least pricing, memory usage and running
time. The system should include more metrics to reflect these properties. Currently the accuracy
metric is based on the last run. Training is probabilistic, meaning that accuracy may vary for
each training run. Reporting only the last run can lead to incorrect conclusions, so the program
should execute multiple runs and summarize the results. Lastly, the used datasets are small or
domain specific. More and bigger datasets would allow for more statistical powerful conclusions.
Next, the tool (and knowledge obtained by creating the tool) is used to work on the following
research question and goal.
RQ2. Can the classification accuracy for NLU systems and services be increased?
The search for improvement has considered increasing the amount of training data and using
new meta-learning algorithms and embeddings. A recent model called Google BERT [32] is ex-
pected to be the most likely candidate for increasing accuracy. The model provides a pre-trained
checkpoint which has ‘learned’ about language by reading large amounts of text. The pre-trained
checkpoint can then be fine-tuned on some specific NLP task using transfer learning. It is a big
model, meaning that fine-tuning takes around 1,5 days on a modern computer and a few minutes
on a high-end GPU. Experiments on intent classification datasets show non-significant improve-
ments in accuracy. To improve accuracy further the model has been jointly trained on intent
classification and named-entity recognition. The benefit is that named-entity information can be
used to determine the intent and vice versa. The Google model is a good candidate for jointly
training because it, unlike other recent models, uses left and right context in all layers of the net-
work. BERT has obtained state-of-the-art results in a wide range of tasks including named-entity
recognition. These two facts imply that jointly training BERT should obtain state-of-the-art res-
ults on the joint intent classification and NER task. Basic experiments are conducted in which
training BERT separately is compared to training it jointly. The experiments show that jointly
training is possible and compared to separate training obtains higher accuracies for intent classi-
fication and NER while requiring fewer training steps. Compared to a baseline the intent accuracy
is equal or higher for datasets with around 100 training examples. Future work is needed to see
whether the model implementation can be improved and whether the improvements in accuracy
are significant.
[1] SAP Conversational AI. Build great bots in minutes. https://cai.tools.sap, 2019.
[2] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom
Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent
by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–
3989, 2016.
[3] IBM Watson Assistant. Improve customer and employee experiences with AI. https:
//www.ibm.com/cloud/watson-assistant, 2019.
[4] Sourabh Bajaj. BERT finetuning tasks in 5 minutes with Cloud TPU. https:
//colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/
bert_finetuning_with_cloud_tpus.ipynb, 2018. Accessed: 2018-12-27.
[5] Dilyara Baymurzina, Aleksey Lymar, and Alexey Sorokin. intents snips.json.
https://github.com/deepmipt/DeepPavlov/blob/0.1.5.1/deeppavlov/configs/
classifiers/intents_snips.json, 2019.
[6] Dilyara Baymurzina, Aleksey Lymar, Mary Trofimova, Yura Kuratov, and Nikolay
Bushkov. classifiers.rst. https://github.com/deepmipt/DeepPavlov/blob/0.1.5.1/
docs/components/classifiers.rst, 2019.
[7] Steven Bird and Edward Loper. NLTK: the natural language toolkit. In Proceedings of
the ACL 2004 on Interactive poster and demonstration sessions, page 31. Association for
Computational Linguistics, 2004.
[8] Tom Bocklisch, Joey Faulker, Nick Pawlowski, and Alan Nichol. Rasa: Open source language
understanding and dialogue management. arXiv preprint arXiv:1712.05181, 2017.
[9] Botfuel.io. Build enterprise grade chatbots fueled by conversational AI. https://www.
botfuel.io, 2019.
[10] Daniel Braun, Adrian Hernandez-Mendez, Florian Matthes, and Manfred Langen. Evaluat-
ing natural language understanding services for conversational question answering systems.
In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 174–
185, 2017.
[12] Jake D Brutlag and Christopher Meek. Challenges of the email domain for text classification.
In ICML, pages 103–110, 2000.
[13] Mikhail Burtsev, Alexander Seliverstov, Rafael Airapetyan, Mikhail Arkhipov, Dilyara
Baymurzina, Nickolay Bushkov, Olga Gureenkova, Taras Khakhulin, Yuri Kuratov, Denis
Kuznetsov, et al. DeepPavlov: Open-source library for dialogue systems. Proceedings of
ACL 2018, System Demonstrations, pages 122–127, 2018.
[14] Erik Cambria and Bebo White. Jumping NLP curves: A review of natural language pro-
cessing research. IEEE Computational intelligence magazine, 9(2):48–57, 2014.
[15] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah
Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence en-
coder. arXiv preprint arXiv:1803.11175, 2018.
[16] Nancy Chinchor, Erica Brown, Lisa Ferro, and Patty Robinson. 1999 named entity recogni-
tion task definition. MITRE and SAIC, 1999.
[17] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bou-
gares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN
encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
[18] Jinho D Choi, Joel Tetreault, and Amanda Stent. It depends: Dependency parser compar-
ison using a web-based evaluation tool. In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), volume 1, pages 387–396, 2015.
[19] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical
evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint
arXiv:1412.3555, 2014.
[20] Dan C Cireşan, Ueli Meier, and Jürgen Schmidhuber. Transfer learning for Latin and Chinese
characters with deep neural networks. In Neural Networks (IJCNN), The 2012 International
Joint Conference on, pages 1–6. IEEE, 2012.
[21] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and
Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learn-
ing Research, 12(Aug):2493–2537, 2011.
[22] Alexis Conneau, Holger Schwenk, Loıc Barrault, and Yann Lecun. Very deep convolutional
networks for natural language processing. arXiv preprint, 2016.
[23] Alice Coucke. Benchmarking natural language understanding systems: Google, Facebook,
Microsoft, Amazon, and Snips. https://medium.com/snips-ai/2b8ddcf9fb19, 2017. Ac-
cessed: 2019-01-18.
[25] Robert Dale. Text analytics APIs, part 2: The smaller players. Natural Language Engineer-
ing, 24(5):797–803, 2018.
[26] Franca Debole and Fabrizio Sebastiani. Supervised term weighting for automated text cat-
egorization. In Text mining and its applications, pages 81–97. Springer, 2004.
[28] Jacob Devlin. bert - TensorFlow code and pre-trained models for BERT. https://github.
com/google-research/bert, 2018.
[67] Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional LSTM-CNNs-
CRF. arXiv preprint arXiv:1603.01354, 2016.
[68] Christopher Manning and Richard Socher. Natural language processing with deep learning.
Lecture Notes Stanford University School of Engineering, 2017.
[70] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed
representations of words and phrases and their compositionality. In Advances in neural
information processing systems, pages 3111–3119, 2013.
[71] Alan Nichol. How to write a custom Estimator model for the Cloud TPU. https://medium.
com/rasa-blog/6daf794efcd8, 2018. Accessed: 2019-01-25.
[72] Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint
arXiv:1803.02999, 2018.
[73] Taiichi Ohno. Toyota production system: beyond large-scale production. crc Press, 1988.
[75] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for
word representation. In Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP), pages 1532–1543, 2014.
[76] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton
Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint
arXiv:1802.05365, 2018.
[77] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language
understanding by generative pre-training.
[78] Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable
questions for SQuAD. arXiv preprint arXiv:1806.03822, 2018.
[79] Pranav Rajpurkar, Robin Jia, and Percy Liang. The stanford question answering dataset
(SQuAD) explorer. https://rajpurkar.github.io/SQuAD-explorer, 2019.
[80] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Trans-
actions on Signal Processing, 45(11):2673–2681, 1997.
[81] Hinrich Schütze, Christopher Manning, and Prabhakar Raghavan. Introduction to informa-
tion retrieval, volume 39. Cambridge University Press, 2008.
[86] Marina Sokolova, Nathalie Japkowicz, and Stan Szpakowicz. Beyond accuracy, F-score and
ROC: a family of discriminant measures for performance evaluation. In Australasian joint
conference on artificial intelligence, pages 1015–1021. Springer, 2006.
[89] Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. Learn-
ing general purpose distributed sentence representations via large scale multi-task learning.
arXiv preprint arXiv:1804.00079, 2018.
[90] Eiichiro Sumita and Hitoshi Iida. Experiments and prospects of example-based machine
translation. In Proceedings of the 29th annual meeting on Association for Computational
Linguistics, pages 185–192. Association for Computational Linguistics, 1991.
[91] Erik Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 shared task:
Language-independent named entity recognition. In Proceedings of the seventh conference
on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association
for Computational Linguistics, 2003.
[92] Nguyen Trong Canh. Benchmarking intent classification services - June 2018. https://
medium.com/botfuel/eb8684a1e55f, 2018. Accessed: 2019-01-18.
[93] Jakob Uszkoreit. Transformer: A novel neural network architecture for language understand-
ing. https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html,
2017. Accessed: 2018-12-10.
[95] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
L ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural
Information Processing Systems, pages 5998–6008, 2017.
[96] Stephan Vogel, Hermann Ney, and Christoph Tillmann. HMM-based word alignment in
statistical translation. In Proceedings of the 16th conference on Computational linguistics-
Volume 2, pages 836–841. Association for Computational Linguistics, 1996.
[99] Georg Wiese. Enhancing intent classification with the universal sentence
encoder. https://scalableminds.com/blog/MachineLearning/2018/08/
rasa-universal-sentence-encoder, 2018. Accessed: 2018-12-03.
[100] Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge cor-
pus for sentence understanding through inference. In Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies. Association for Computational Linguistics, 2018.
[102] Thomas Wolf, Victor Sanh, Gregory Chatel, and Tim Rault. pytorch-pretrained-BERT.
https://github.com/huggingface/pytorch-pretrained-BERT, 2018. Accessed: 2018-12-
27.
[103] Tom Wolf, Victor Sanh, and Tim Rault. bert - TensorFlow code and pre-trained models for
BERT. https://github.com/huggingface/pytorch-pretrained-BERT, 2018.
[104] Xuesong Yang, Yun-Nung Chen, Dilek Hakkani-Tür, Paul Crook, Xiujun Li, Jianfeng Gao,
and Li Deng. End-to-end joint learning of natural language understanding and dialogue
manager. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International
Conference on, pages 5690–5694. IEEE, 2017.
[105] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent trends in
deep learning based natural language processing. ieee Computational intelligenCe magazine,
13(3):55–75, 2018.
[106] Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro,
Haoyu Wang, and Bowen Zhou. Diverse few-shot text classification with multiple metrics.
arXiv preprint arXiv:1805.07513, 2018.
[107] Chenwei Zhang, Yaliang Li, Nan Du, Wei Fan, and Philip S Yu. Joint slot filling and intent
detection via capsule neural networks. arXiv preprint arXiv:1812.09471, 2018.
[108] Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. The design and implementation of
XiaoIce, an empathetic social chatbot. arXiv preprint arXiv:1812.08989, 2018.
bench usage
Installation is similar to other Python projects. Pull the code and in a terminal set the current
working directory to the project folder. Install the required pip packages by running the following
command.
pip install -r requirements.txt
If one wants to check accuracy for an open-source system then run the following command.
docker-compose up
docker-compose will read ‘docker-compose.yml’ and use that information to spin up various
Docker containers. All dockers listed in the file are available from Docker Hub. This avoids having
to build Dockers manually. DeepPavlov has been removed from the configuration file, since it was
found to be unstable, see Section 3.2.2.
After the set-up the program can be executed by running ‘bench.py’. To change on which
system the benchmarking occurs, replace the first parameter in the get system corpus call.
The prefix is used to determine which system is being tested. Possible prefix options are ‘mock’,
‘rasa’, ‘deeppavlov’, ‘lex’ and ‘dialogflow’. Rasa and DeepPavlov will use the complete string to
find a matching port from ‘docker-compose.yml’. So, based on the Docker configuration one can
also specify ‘rasa-tensorflow’, ‘rasa-spacy’ or ‘rasa-mitie’. The corpus (dataset) to run the bench on
is specified by an enumerable, see src.typ.Corpus for possible options. When running the script
in a modern IDE autocomplete will suggest the possible corpora. Slightly more convenient would
be to have the script take input arguments using sys.argv. After setting the two parameters the
script can be executed and will display all predictions as well as micro, macro and weighted F1
scores. The predictions and F1 scores will also be written to files, see the ‘results’ folder.
Python is not a pure functional language. However more and more constructs of functional pro-
gramming are being added to the language each year. This appendix will explain some functional
ideas used in the code, as presented by Lott [65]. Higher-order functions take or return functions,
this is used to replace the factory design pattern as explained in Section B.1. Keeping track of
state without a class results in function signatures to contain many parameters, these can be
handled by using NamedTuples, see Section B.2. Another benefit of classes is that data can be
stored, used for example in caching. A convenient solution for caching is described in Section B.3.
Collections of data are typically transformed via loops. Here each loop will transform the entire
collection and move to the next transformation. Lazy evaluation, as described in Section B.4, uses
a more efficient way.
Note that get substring match() implements the substring matching used in the condi-
tional code (if ’mock’ in system.name:). Since the code can return any of the functions
contained in the mapping they should all have the same signature and output. The used IDE
(PyCharm 2018.2.4) is not able to check this. Therefore, functions from the mapping func get a
type hint. This allows the IDE to check types again and it allows developers to see what signature
should be used for all the functions in the mapping.
B.2 NamedTuple
Pure functions by definition cannot rely on information stored somewhere in the system. We
provide one example from the code where this created a problem and how this can be solved using
NamedTuples.
The benchmarking tools communicates with a system called Rasa. Rasa starts in a default,
untrained, state. To measure its performance we train Rasa and then send many sentences to
the system. In general one prefers to functions should be as generic as possible. It makes sense
to have one function which takes some sentence, sends it to Rasa to be classified and returns all
information from the response. To avoid re-training Rasa for each system we have to remember
whether Rasa is already trained. Passing a flag ’retrain’ to the system is insufficient, since the
function does not know where Rasa should train on. To make it all work we need the following
parameters:
system name: Used to call the function which can train the specific system we are interested
in.
There is one caveat with this using function caching. Make sure to not try to mimic state. In
other words the program should not change behaviour if the cache is removed. Reason for this
is that any state introduced via the cache is similar to creating functions with side-effects but
without all the constructs from object-oriented programming.
my_list = []
for item in some_iterable:
updated_item = g(f(item))
my_list.append(updated_item)
In this code some iterable is read and the transformation f and g are applied to each item in
the iterable. The same code can be rewritten to use map as follows.
This appendix demonstrates the effect of using iterators instead of regular collections. The code
demonstrates this by processing some fictional raw materials to a chair. The first is function called
ford is similar to a Ford factory around 1915. Here each part of the assembly line just keeps
producing items as long as there is input coming in. After a while the other parts of the assembly
line start processing the items and discover a fault in the items. One problem of this way of
working is that the factory now has a pile of incorrect items in their stock.
The second function called toyota is similar to a Toyota factory after 1960. Here just-in-time
(JIT) manufacturing is used as developed by Toyota [73]. Each item is processed only when the
next step in the process makes a request for this item.
C.1 Benefits
Using JIT makes sense in computer programs for the following reasons. It saves memory. In each
step in the process we only store one intermediate result instead of all intermediate results.
It can detect bugs earlier. Suppose you got a combination of processing steps, lets call them f
and g and you apply them to 100 items. In f we send some object to a system and get a response.
In h we store the response of this API call. Suppose there is a bug in h, lets say the file name
is incorrect. Suppose this is not covered in the tests and we decide to run our program to get
all the results we want. Using an approach similar to ford the program crashes after doing 100
executions of f and g. This means that the program executed 100 API calls. Using toyota the
program crashes after just one API call. Here ford has in essence wasted 99 API calls.
It does not make assumptions for the caller. Suppose some function k returns an iterable and
is called by l. The function l can now decide how it wants to use the iterable. For example it can
be casted to unique values via set or it can partly be evaluated by using any.
C.2 Code
from typing import List, NamedTuple
""" See README.md """
def ford():
""" Processing all the items at once and going to the next step. """
def remove_faulty(items: List[Wood]) -> List[Wood]:
out = []
for material in items:
print('inspecting {}'.format(material))
if material.id != 1:
out.append(material)
return out
filtered = remove_faulty(materials)
processed = process(filtered)
print('Result of ford(): {}'.format(processed))
def toyota():
""" Processing all the items one by one. """
def is_not_faulty(material: Wood) -> bool:
print('inspecting {}'.format(material))
return material.id != 1
if __name__ == '__main__':
ford()
print()
toyota()
C.3 Output
The output for the program is as follows.
inspecting Wood(id=0)
inspecting Wood(id=1)
inspecting Wood(id=2)
processing Wood(id=0)
processing Wood(id=2)
Result of ford(): [Chair(id=0), Chair(id=2)]
inspecting Wood(id=0)
processing Wood(id=0)
inspecting Wood(id=1)
inspecting Wood(id=2)
processing Wood(id=2)
Result of toyota(): [Chair(id=0), Chair(id=2)]
This demonstrates that iterator elements are only executed when called.
The F1 score calculation by [10] uses micro averaging. For two reasons this appendix focuses these
calculations on intent only. The first reason is that these results are compared against benchmarks
from the bench project in Section 3.3.2. Secondly, micro averages could be skewed when the
number of intents and entities differ, as described in Section 3.4.2. It is interesting to compare
the differences. This appendix lists calculations for Rasa in Table D.1, Watson Conversation in
Table D.2 and Microsoft LUIS in Table D.3.
improv
The project aimed to improve accuracy is called improv and available on Github (https://
github.com/rikhuijzer/improv). One warning for people interested in reading or using this
code is that it is in need of refactoring. The code is cloned from the Google BERT code, built on
the Google TensorFlow library, as provided by the researchers [28].
Alternatively, code is available for the PyTorch library [103]. The PyTorch implementation is
under active development unlike the TensorFlow implementation and includes more functionality.
Features include Multi-GPU, distributed and 16-bits training. These allow the model to be trained
more easily on GPUs by reducing and distributing the memory used by the model. BERT contains
various models including BERTBASE and BERTLARGE . Since Google Colab does not provide a
multi-GPU set-up, we need to use a TPU. This is not yet supported by PyTorch [103].
E.1 Usage
The improv code can partially be executed on a local system. However, training the model
requires at least one enterprise grade GPU. This is discussed in Section 4.2.2. GPUs and TPUs
are provided for free by Google Colab [38]. Using this code means importing one of IPython
Notebooks from the improv repository in Colab. Hyperparameters can be set in the Notebook
after which the code can be executed. The Notebook require a Google Account combined with a
paid Google Cloud Bucket. The Bucket is used to store the trained checkpoints created by training
the model. Newer runs listed in the ‘runs’ folder in the Github repository depend on improv,
nlu datasets and rasa nlu. The dependencies list the used version in the Notebook. When
errors occur make sure that the correct versions are cloned or installed.
Two reasons for not supporting TensorBoard summaries in TPUs are likely. Firstly, it seems
that the TPUs are mainly used for inference [58]. Secondly, the TPUs are a type of application-
specific integrated circuit (ASIC) meaning that TensorBoard summaries are omitted due to tech-
nically issues.
1 https://github.com/rikhuijzer/improv/blob/master/runs/snips2017/2018-12-20snipsintentbatchsize8.
ipynb
2 https://github.com/rikhuijzer/improv/tree/master/runs/2019-01-23snips
Runs
In Section 4.3 the results for the Google BERT runs are summarized. The runs on which the
summary is based are listed in this appendix. For WebApplications, AskUbuntu, Chatbot and
Snips2017 see respectively Table F.1, F.2, F.3 and F.4.
Batch
Method size Run Loss Intent Entity
Rasa 1 0.674
Rasa 2 0.722
Rasa 3 0.625
Table F.1: Training loss, intent classification score and NER score for WebApplications. Loss is
determined using the training set. Both scores are weighted F1 as determined using the evaluation
set. Source: https://github.com/rikhuijzer/improv/tree/master/runs/webapplications.
Batch
Method size Run Loss Intent Entity
Rasa 1 0.834
Rasa 2 0.833
Rasa 3 0.843
Table F.2: Training loss, intent classification score and NER score for AskUbuntu. Loss is de-
termined using the training set. Both scores are weighted F1 as determined using the evaluation
set. Source: https://github.com/rikhuijzer/improv/tree/master/runs/askubuntu.
Batch
Method size Run Loss Intent Entity
Rasa 1 0.981
Rasa 2 0.981
Rasa 3 0.981
Table F.3: Training loss, intent classification score and NER score for Chatbot. Loss is determined
using the training set. Both scores are weighted F1 as determined using the evaluation set. Source:
https://github.com/rikhuijzer/improv/tree/master/runs/chatbot.
Batch
Method size Run Loss Intent Entity
Rasa 1 0.991
Rasa 2 0.990
Rasa 3 0.990
Table F.4: Training loss, intent classification score and NER score for Snips2017. Loss is determ-
ined using the training set. Both scores are weighted F1 as determined using the evaluation set.
Source: https://github.com/rikhuijzer/improv/tree/master/runs/snips2017.