Test Case Generation AI ML
Test Case Generation AI ML
Test Case Generation AI ML
ALZAHRAA SALMAN
ALZAHRAA SALMAN
Abstract
Software testing plays a fundamental role in software engineering as it ensures
the quality of a software system. However, one of the major challenges of
software testing is its costs since it is a time and resource-consuming process
which according to academia and industry can take up to 50% of the total
development cost. Today, one of the most common ways of generating test
cases is through manual labor by analyzing specification documents to produce
test scripts, which tends to be an expensive and error prone process. Therefore,
optimizing software testing by automating the test case generation process can
result in time and cost reductions and also lead to better quality of the end
product.
Currently, most of the state-of-the-art solutions for automatic test case genera-
tion require the usage of formal specifications. Such formal specifications are
not always available during the testing process and if available, they require
expert knowledge for writing and understanding them. One artifact that is of-
ten available in the testing domain is test case specifications written in natural
language. In this thesis, an approach for generating integration test cases from
natural language test case specifications is designed, applied and, evaluated.
Machine learning and natural language processing techniques are used to im-
plement the approach. The proposed approach is conducted and evaluated
on an industrial testing project at Ericsson AB in Sweden. Additionally, the
approach has been implemented as a tool with a graphical user interface for
aiding testers in the process of test case generation.
The approach involves performing natural language processing techniques for
parsing and analyzing the test case specifications to generate feature vectors
that are later mapped to label vectors containing existing C# test scripts file
names. The feature and label vectors are used as input and output, respectively,
in a multi-label text classification process. The approach managed to produce
test scripts for all test case specifications and obtained a best F1 score of 89%
when using LinearSVC as the classifier and performing data augmentation on
the training set.
Keywords— Software testing, Test case generation, Natural Language Processing,
Test case specifications
iv
Sammanfattning
Programvarutestning spelar en grundläggande roll i programvaruutveckling då
den säkerställer kvaliteten på ett programvarusystem. En av de största utma-
ningarna med programvarutestning är dess kostnader eftersom den är en tids-
och resurskrävande process som enligt akademin och industrin kan ta upp till
50% av den totala utvecklingskostnaden. Ett av de vanligaste sätten att genere-
ra testfall idag är med manuellt arbete genom analys av testfallsspecifikationer,
vilket tenderar att vara en dyr och felbenägen process. Därför kan optimering
av programvarutestning genom automatisering av testfallsgenereringsproces-
sen resultera i tids- och kostnadsminimeringar och även leda till bättre kvalitet
på slutprodukten.
Nuförtiden kräver de flesta toppmoderna lösningarna för automatisk test-
fallsgenerering användning av formella specifikationer. Sådana specifikatio-
ner är inte alltid tillgängliga under testprocessen och om de är tillgängliga,
så krävs det expertkunskap för att skriva och förstå dem. En artefakt som
ofta finns i testdomänen är testfallspecifikationer skrivna på naturligt språk.
I denna rapport utformas, tillämpas och utvärderas en metod för generering
av integrationstestfall från testfallsspecifikationer skrivna på naturligt språk.
Maskininlärnings- och naturlig språkbehandlingstekniker används för imple-
mentationen av metoden. Den föreslagna metoden genomförs och utvärderas
vid ett industriellt testprojekt hos Ericsson AB i Sverige. Dessutom har meto-
den implementerats som ett verktyg med ett grafiskt användargränssnitt för att
hjälpa testare i testfallsgenereringsprocessen.
Metoden fungerar genom att utföra naturlig språkbehandlingstekniker på test-
fallsspecifikationer för att generera egenskapsvektorer som senare mappas till
etikettsvektorer som innehåller befintliga C# testskriptfilnamn. Engenskaps-
och etikettsvektorerna används sedan som indata och utdata, respektive, för
textklassificeringsprocessen. Metoden lyckades producera testskript för alla
testfallsspecifikationer och fick en bästa F1 poäng på 89% när LinearSVC an-
vändes för klassificeringen och datautökning var utförd på träningsdatat.
Nyckelord— Programvarutestning, Testfallsgenerering, Naturlig språkbe-
handling, Testfallspecifikationer
v
Acknowledgements
I would like to begin this thesis by expressing my deepest gratitude to my su-
pervisor at Ericsson, Sahar Tahvili, for her dedication, motivation, and guid-
ance. I really appreciate the valuable help and support that I have received
from her throughout the entire duration of this degree project.
I am also grateful to Somayeh Aghanavesi, my supervisor at KTH, for her
continuous feedback and advice. Moreover, I would like to thank my examiner
Viggo Kann for showing interest in my thesis subject and for taking the time
to review my master thesis.
Last but not least, my profound gratitude goes to my loving family, partner and
closest friends for being there for me and supporting me, not only during the
degree project but throughout my years at KTH. None of this would have been
possible without them.
Sincerely,
Alzahraa Salman
Contents
1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Test Optimization . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Test Case Generation . . . . . . . . . . . . . . . . . . 6
2.2 Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Test Case Specification . . . . . . . . . . . . . . . . . 8
2.3 Natural Language Processing . . . . . . . . . . . . . . . . . . 8
2.4 Machine Learning and NLP . . . . . . . . . . . . . . . . . . . 11
2.4.1 Text Classification . . . . . . . . . . . . . . . . . . . 11
2.4.1.1 Linear Support Vector Classifier . . . . . . 12
2.4.1.2 K-Nearest Neighbors Classifier . . . . . . . 13
2.4.2 Feature Engineering . . . . . . . . . . . . . . . . . . 14
2.4.3 Data Augmentation . . . . . . . . . . . . . . . . . . . 15
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Methods 20
3.1 Overview of the Approach . . . . . . . . . . . . . . . . . . . 20
3.2 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Text Analysis using NLP . . . . . . . . . . . . . . . . . . . . 23
3.4.1 Feature Extraction . . . . . . . . . . . . . . . . . . . 24
3.5 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . 25
3.5.2 Multi-Label Classification . . . . . . . . . . . . . . . 26
vi
CONTENTS vii
4 Results 31
4.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 33
4.3 The Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Discussion 38
5.1 Analysis of the Results . . . . . . . . . . . . . . . . . . . . . 38
5.2 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . 43
5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Conclusions 47
Bibliography 49
Chapter 1
Introduction
1
2 CHAPTER 1. INTRODUCTION
ifications [7]. Specifications are used as one of the primary sources of infor-
mation for deriving test cases. Test case specifications are most often written
in natural languages, such as English, since these specifications need to be
easy to specify, use, and understand [3]. The fact that most of the specifica-
tions are written in natural language, makes the usage of Natural Language
Processing (NLP) techniques an appropriate methodology for generating test
cases. Promising results have been obtained when using NLP techniques for
identifying desired information from large raw data. These results have lead
to increased interest in NLP techniques, especially in the field of automation
of software development activities such as test case generation [8].
One of the main applications of NLP in software testing is specification mining
which is also often used in the process of test case generation. Several previous
studies [3, 9, 10, 11] have addressed the generation of test cases from spec-
ifications. However, most of the state-of-the-art proposed solutions require
the usage of formal specifications during the process of generating test cases.
Such formal specifications are not always available during the testing process
and if available, they require expert knowledge for writing and understanding
them.
In this thesis, an approach for generating integration test cases from test case
specifications written in natural language is designed, applied, and evaluated.
Machine learning and NLP techniques are used to implement the approach
which involves mapping test case specifications to existing test scripts in C#
containing the code needed for executing the test cases. The proposed ap-
proach is applied and evaluated on an industrial testing project at Ericsson AB
in Sweden. Finally, the approach has been implemented as a tool for aiding
testers in the process of test case generation.
1.1 Purpose
The purpose of this degree project is to investigate the possibility of improv-
ing the process of software testing by incorporating NLP and machine learning
techniques for automatic test case generation. The project is conducted at Eric-
sson AB at their Global Artificial Intelligence Accelerator (GAIA) group who
are interested in applying artificial intelligence and machine learning tech-
niques in different domains. One such domain is software testing where artifi-
cial intelligence and machine learning can be used for optimization purposes.
Research in the area of software testing is very important and beneficial since
software testing plays a fundamental role in the process of ensuring the cor-
CHAPTER 1. INTRODUCTION 3
1.3 Scope
There exists different sources of information that can be used for test case gen-
eration. Test cases can be generated from source code, binary code, UML
models, formal specifications et cetera [12]. In this degree project, the test
cases are generated by only using test case specifications that are written in
natural language. Moreover, the tool implemented in this project only gives
suggestions of existing C# test scripts for integration testing and can not pro-
duce unseen test scripts. The tool can be extended to generating test scripts in
any programming language and in any testing level. In such case, mappings
to test scripts in the specific programming language must be provided.
4 CHAPTER 1. INTRODUCTION
Background
5
6 CHAPTER 2. BACKGROUND
Several approaches and methods for generating test cases have been presented
throughout the years. Often, test cases are derived from system requirements
and specifications. Prasanna et al. [7] divide test case generation approaches
mainly into two categories:
• Specification-based test case generation: aims at examining the func-
tionality of an application based on the specifications. This type of test-
ing is limited by the quality of the specification [24].
• Model-based test case generation: the test cases are derived from a
model that describes the functional aspects of the system under test [25].
This type of test case generation requires formal models to perform the
testing.
In this thesis, specification-based test case generation, using test case speci-
fications, is addressed. The approach of mapping test case specifications to
test scripts in C# code is an important step in fully automating the test case
generation process.
2.2 Specifications
Specifications are one of the primary sources of information for deriving test
cases. There exist several types of specifications, such as software requirement
specifications, functional requirement specifications, and test case specifica-
tions. A formal definition of a specification is provided by IEEE Standard
Glossary of Software Engineering Terminology (IEEE Std 610.12-199) [22]:
Definition 2.2.1. The specification is a document that specifies, in a complete,
precise, verifiable manner, the requirements, design, behavior, or other char-
acteristics of a system or component, and, often, the procedures for determin-
ing whether these provisions have been satisfied.
These specification documents can be written as formal or informal specifica-
tions [26]. Although formal specifications are important for verifying the cor-
rectness of software, their usage is often not that common in software engineer-
ing practices since producing formal specifications requires specialists (e.g.
mathematicians, logicians or computer scientists/engineers) for both writing
and understanding the specifications [9]. A common alternative is producing
informal specifications, often written in natural language [26]. The require-
ments and/or steps described in specifications need to be understood by both
the users and the developers. Natural language seems to be an appropriate
8 CHAPTER 2. BACKGROUND
Figure 2.1: NLP pipeline architecture including some common NLP steps.
There exist several important NLP steps that are often applied to text docu-
ments for analysis. Figure 2.1 shows an example of a simple NLP pipeline
consisting of some of the common NLP steps that can be applied to the speci-
fication documents during the process of test case generation. These steps can
be applied collectively or individually depending on the application under de-
velopment [8]. Text segmentation, which is considered the initial step of NLP,
is the task of dividing a text into linguistically-meaningful units [34]. Sen-
tence segmentation and tokenization are part of the text segmentation stage.
Sentence segmentation is the process of dividing up a text into sentences. To-
kenization, also known as word segmentation, involves breaking the sentence
into individual words called tokens based on a predefined set of rules such
as using white-spaces as a delimiter [30]. Text segmentation is a fundamen-
tal part of any NLP system since it is the words and sentences identified at
this stage that will be passed to later processing stages such as part-of-speech
taggers and parsers. Part-of-speech (POS) tagging is the task of assigning a
morphosyntactic tag to each word and punctuation mark in a text. POS cat-
egories of words are defined by the morphosyntactic analysis using criteria
from both morphology and syntax fields [9]. The algorithm that performs
this analysis in NLP is called the POS-tagger. The NLP process can proceed
CHAPTER 2. BACKGROUND 11
Figure 2.2: LinearSVC with two different values of C. Support vectors are
shown as the circled points. Left: C is set to 1, resulting in a wide margin and
a higher tolerance for observations being on the wrong side of the margin.
Right: C is set to 100 which leads to a smaller margin, reducing the tolerance
for observations to be on the incorrect side of the margin. Image is taken from
[44].
generate their corresponding syntax tree. The syntax trees are mapped into a
semantic representation based on the case grammar theory during the seman-
tic analysis phase. Finally, the last phase, SCR generation, delivers an SCR
specification, which is to be used by the T-VEC tool to generate test cases.
This approach requires however that the specifications are written according
to a controlled natural language. The controlled natural language supported
by this approach is limited, containing only a few lexical categories. This
could reduce the usability of the N AT 2T ESTSCR strategy since requirements
tend to be written in a natural language without the restriction to any con-
trolled version of the language. The strategy is evaluated in four different do-
mains, including the turn indicator of Mercedes vehicles, and the results were
compared to a random testing technique, Randoop. The results showed that
N AT 2T ESTSCR strategy outperformed Randoop.
Wang et al. [11] present an approach called Use Case Modelling for System
Tests Generation (UMTG) that supports the generation of executable system
test cases from use case specifications in natural language. NLP techniques are
used in the approach for identifying test scenarios from use case specifications
and to generate formal constraints, in the Object Constraint Language (OCL).
The formal constraints capture conditions that are used for generating test input
data. The main NLP analysis that UMTG relies on are: tokenization, NER,
POS-tagging, semantic role labeling (SRL), and semantic similarity detection.
This approach requires a domain model (e.g., a class diagram) of the system
and the use case specification, which according to the authors are common in
requirements engineering practices and are often available in many companies
developing embedded systems. The approach is evaluated in two industrial
case studies and promising results are obtained. UMTG manages to effectively
generate system test cases for automotive sensor systems, addressing all test
scenarios in the manually implemented test suite and also produce test cases
for scenarios not previously considered by test engineers.
Verma and Beg [3] argue that expressing the requirements of a system using
semi-formal or formal techniques can lead to limitations since expert knowl-
edge is required for interpreting these requirements and only a limited number
of persons possess this kind of expertise. These limitations are eliminated
by using natural language to document requirements since natural language is
understandable by almost anybody. The authors point out that the most used
method for documenting requirements is a natural language, such as English,
with around 87.7% of the documents being written in natural language. Due
to these facts, the authors present an approach for generating test cases, for ac-
18 CHAPTER 2. BACKGROUND
Most of the proposed methods and strategies in the related literature [9, 10,
11] require the usage of formal specifications during the process of generating
test cases. In these studies, the requirements specifications written in natu-
ral language are translated into formal models that can later be used for test
case generation, whereas in this thesis only test case specifications written in
natural language are used without any translations. Test case specifications
written in natural language are an artifact that is often available in the testing
domain since natural language is easy to use and can be understood by both
the users and the developers. Formal models on the other hand require expert
knowledge to interpret. The work of Verma and Beg [3] make use of specifica-
tions written in natural language without translations. However, only abstract
test cases are generated by their approach, as opposed to the work in this the-
sis where test scripts containing actual code are generated to execute the test
cases. Moreover, the implementation of the proposed approach in this thesis
has been inspired by the related work as some of the NLP techniques used in
CHAPTER 2. BACKGROUND 19
the previous work are used in this thesis as well. This include syntactic and
semantic analysis of the specifications as in [9] and parsing, tokenization, and
POS-tagging as in [3, 8, 11].
Chapter 3
Methods
This chapter describes the methodology of the proposed approach in this the-
sis. First, an overview of the approach is presented, including a pipeline con-
taining the phases of the approach. This is followed by a section where the
available data set is described. The next section focuses on the NLP tech-
niques that are applied to the data for performing text analysis. Moreover, this
section includes details about which features are extracted and for what reason.
The upcoming section describes the preprocessing techniques that are used to
enhance the quality of the data. Subsequently, the process of the classification
is described and the algorithms used are presented. The chapter ends with a
description of the performance measures that are used to evaluate the outcome
of this study.
20
CHAPTER 3. METHODS 21
The approach is based on using NLP techniques in order to suggest test case
scripts written in C# given test case specifications expressed in natural lan-
guage. The test case specifications are first parsed and analyzed to gener-
ate feature vectors. The feature vectors, containing keywords representing
the specification, are then mapped to label vectors consisting of C# code file
names. The feature and label vectors are used as input and output, respectively,
for the text classifier to perform training and prediction. Finally, a simple user-
friendly user interface is implemented using the Python GUI (Graphical User
Interface) package Tkinter [50]. The user is able to select a test case specifi-
cation and the tool gives suggestions of test case scripts that are relevant for
the specification in question.
fication can be mapped to one or more test scripts and a test script can be used
to execute several test cases. Thus, there is not a one-to-one mapping between
the test case specifications and the test scripts.
3.3 Tools
The approach proposed in this study was implemented using the programming
language Python due to its popularity and its useful machine learning pack-
ages. In this section, the tools that are used for conducting the experiments in
this study are presented.
Natural Language Toolkit (NLTK) [30] is one of the most used NLP libraries.
It is written in Python and contains several useful packages. Tokenization,
Stemming, Lemmatization, Punctuation, Character count, and word count are
some of the packages that are included. NLTK is open-source and compatible
with several operative systems. NLTK was used during the first stage of this
project to parse and analyze the test case specification documents.
Stanford log-linear part-of-speech tagger (Stanford POS-tagger) [51] is a
probabilistic POS-tagger developed by the Stanford Natural Language Pro-
cessing Group. The Stanford POS-tagger is used to assign parts of speech to
each word in a text, such as noun, verb and adjective. It is open-source and
widely used in applications in NLP. The Stanford POS-tagger requires Java.
Tkinter [52] is Python’s de-facto standard GUI package. Tkinter is open-
source and cross-platform, meaning that the same code works on different op-
erative systems. Tkinter was used in this project to implement the GUI of the
proposed approach.
Figure 3.3: The steps performed on the test case specification documents dur-
ing the first stage of the approach, resulting in generation of feature vectors.
The process began by cleaning the text of each test case specification docu-
ment by removing all punctuation and converting all words to lower case since
NLTK is case sensitive. The text of the test case specifications was then di-
vided into word tokens by splitting on white-spaces, using NLTK tokenization.
These word tokens were fed to the POS-tagger as input. It was observed that
the default NLTK POS-tagger was not always able to parse all the words in a
sentence correctly. Often, the problem was related to the positions of the com-
ponents in the sentence, for example, if the sentence starts with a verb, such
as “Check temperature ...”, the NLTK tagger identified the verb, “Check”, as a
noun. To overcome this issue, the Stanford POS-tagger [51] was used instead.
The Stanford POS-tagger has been trained with more imperative sentences
compared to the default NLTK tagger and it yielded better results. However,
the Stanford POS-tagger is slower in performing the POS tagging than the
NLTK POS-tagger, but since the amount of test case specifications in hand is
small, the time was not an issue. The last step before generating the feature
vectors was removing all stop words. This was done to minimize noise in the
data.
Figure 3.4: An example of a feature vector extracted from a test case specifi-
cation (raw data).
Each feature vector was later mapped to a label vector containing the names
of the C# code scripts corresponding to the test case specification that is rep-
resented by that feature vector. Two arrays were constructed, F eatures and
Labels, where F eatures[i] contains the feature vector belonging to the ith
test case specification and Labels[i] contains the label vector belonging to the
ith test case specification. The F eatures array is used as input for the text
classifier and Labels as the output.
the type of text and classification problem in this project. Synonym replace-
ment was not used since the approach would lead to higher dimensionality
which would make it difficult to get statistically meaningful information on the
distribution of the data. Swapping words in a sentence was not suitable either
since the order of the words is not important when transforming the sentence
to a feature vector. Data augmentation in this project was done by choosing
elements from the power set, the set of all subsets, of a feature vector. Not
all subsets were selected and added to the training data, for example subsets
containing only one word were excluded. The newly generated feature vectors
were labelled with the same label as the vector they were generated from. Fig-
ure 3.5 shows this data augmentation technique applied to an example feature
vector. As seen in Figure 3.5, multiple new feature vectors are generated from
an existing feature vector.
After performing data augmentation on the training data, all data were encoded
using one-hot encoding to keep track of the categories (features and labels) in a
numerically meaningful way. Each feature and label in a test case specification
was transformed to either 1 or 0 depending on whether that feature or label
belonged to that test case specification.
randomly is to avoid problems with missing labels. If the data were split ran-
domly, there is a risk that some classes become missing from the training data
set which is a common problem when having low occurrence frequency of the
classes. In this project, some of the labels occurred very rarely. Therefore,
when splitting the data randomly, some of these labels would be missing from
one side of the split. Missing some of the labels in the training set is problem-
atic since the classifier will not be able to train with these labels and thus will
not predict them for any instances.
To perform the multi-label classification, the OVA strategy was used with the
assumption that the labels in the data set are mutually exclusive. The OVA
strategy was used in this project for its simplicity and because it has proven
to yield good performance in practice [41]. Two classifiers were applied on
the generated vectors: LinearSVC and KNN classifier. LinearSVC is one of
the algorithms that performs well on a range of NLP-based text classification
tasks. Grid search was performed to choose the best value of the regularization
parameter, C, for LinearSVC. The C value of 0.6 gave the best results and was
therefore selected. Also, LinearSVC is relatively fast and does not need much
data for training. KNN classifier was also applied and its results were com-
pared to the results achieved by the LinearSVC. KNN is easy to implement,
requiring only two parameters: the number of neighbors, K, and the distance
function. Two values for the parameter K were tested during the experiments,
1 and 3. Instead of using the default distance function for KNN (Euclidian
distance), Sørensen-Dice distance [53, 54] was used. Sørensen-Dice distance
is a metric intended for boolean-valued vector spaces which is suitable in this
study since our data was converted to boolean values during the one-hot en-
coding.
Figure 3.6: The relation between ground truth set and prediction set. The left
side of the box contains the relevant elements, i.e. all the labels that exist in
the ground truth set. The right side of the box contains the negative elements
of the ground truth set. The circle denotes the positive predicted elements, i.e.
positive elements in the prediction set.
negative samples than positive ones. It is known that using accuracy as a per-
formance metric for imbalanced data sets can yield in misleading results [55].
Therefore, it was decided to implement and use a balanced accuracy function
[56] adjusted for multi-label classification. In the balanced accuracy function,
the number of true positive and true negative predictions are normalized by the
number of positive and negative samples, respectively. The balanced accuracy
function is calculated according to Equation 3.1, whereas the usual accuracy
function, i.e. the proportion of correct predictions, is calculated according to
Equation 3.2.
1 TP TN
Balanced Accuracy = + (3.1)
2 TP + FN TN + FP
TP + TN
Accuracy = (3.2)
TP + FP + TN + FN
Additionally, Precision, Recall, and F1 score were also calculated and used
to measure the performance of the proposed approach since these metrics put
more weight on the True Positive elements which in this study are considered
to be of most importance. The precision is the number of correctly predicted
C# scripts divided by the total number of C# scripts predicted by the proposed
approach. This indicates how many of the selected items are relevant.
TP
P recision = (3.3)
TP + FP
The recall is the number of correctly predicted C# scripts divided by the total
number of existing C# scripts in the ground truth. This indicates how many of
the relevant items are selected.
TP
Recall = (3.4)
TP + FN
F1 score is a harmonic average of Precision and Recall and is used as a com-
pound measurement. F1 score reaches its best value at 1 (corresponding to
perfect precision and recall) and worst at 0.
P recision · Recall
F1 = 2 · (3.5)
P recision + Recall
30 CHAPTER 3. METHODS
F1 score was calculated using the Scikit-learn’s f1_score function [44] where
the average parameter was set to ’samples’ in order to calculate metrics for
each instance and find the average. This is preferable for multi-label classifi-
cation.
Chapter 4
Results
31
32 CHAPTER 4. RESULTS
Figure 4.1: Frequencies of the C# test scripts for test case specifications.
Figure 4.2: The frequency of the ten most frequent features in the test case
specifications.
Table 4.1: The F1 score, Precision, Recall, and balanced Accuracy metrics
using the data from three Ericsson products. The results are obtained with
data augmentation.
The presented results in Table 4.1 are obtained when performing data augmen-
tation on the training set whereas the results in Table 4.2 are retrieved without
using data augmentation.
34 CHAPTER 4. RESULTS
Table 4.2: The F1 score, Precision, Recall, and balanced Accuracy metrics
using the data from three Ericsson products. The results are obtained without
data augmentation.
The performance results shown in Tables 4.1 and 4.2 indicate the the best per-
forming model for the proposed approach is the one using LinearSVC as clas-
sifier with data augmentation on the training set which achieved an F1 score
of 89.3% and a balanced accuracy of 98.2%.
The results in Tables 4.1 and 4.2 are calculated for each instance of the testing
set, i.e. for each test case specification, and averaged over the total amount of
test case specifications in the testing set. For the best model, the number of
true positives, true negatives, false positives, and false negatives for each test
case specification are shown in Table 4.3. The results for LinearSVC in Table
4.1 are calculated using the values in Table 4.3 by averaging the values of this
table row-wise.
CHAPTER 4. RESULTS 35
Table 4.3: Comparison of the results of LinearSVC and the ground truth, for
each of the test case specifications in the testing data set, based on 72 test
scripts. TP: True Positive, TN: True Negative, FP: False Positive and FN:
False Negative.
Specification TP TN FP FN
1 2 68 2 0
2 1 71 0 0
3 1 71 0 0
4 1 70 1 0
5 2 69 1 0
6 1 71 0 0
7 2 70 0 0
8 1 71 0 0
9 1 71 0 0
10 1 71 0 0
11 1 71 0 0
12 1 70 1 0
13 1 69 2 0
14 0 70 1 1
15 1 71 0 0
16 1 71 0 0
17 1 69 1 1
18 1 71 0 0
19 1 71 0 0
20 1 70 1 0
21 1 71 0 0
22 1 70 1 0
23 1 71 0 0
24 1 70 1 0
25 1 71 0 0
26 1 71 0 0
27 1 71 0 0
28 1 71 0 0
29 1 71 0 0
30 1 71 0 0
36 CHAPTER 4. RESULTS
The scalability of the approach was tested by training the best model with
different data sizes, i.e. with different number of test case specifications. Lin-
earSVC was used as classifier during this experiment and the F1 score obtained
for each size was recorded. Figure 4.3 shows the performance of the approach
for different sizes of the training set. It is visible in Figure 4.3 that by increas-
ing the training set size, the performance increases as well. The algorithm
converges when it reaches around 70 test case specifications.
Figure 4.3: The scalability of the approach showing the F1 scores obtained
when training the model with different amount of test case specifications. Lin-
earSVC is used as the classifier.
The user can easily select a new test case specification file as input for the
tool. The specification document must be written in English and have a .docx
format. The expected output of the tool is suggestions of known C# test scripts
file names that are relevant for the test case specification in question. These test
scripts contain the code that should be run to execute the test case described
in the selected test case specification. The tool has been able to suggest test
scripts for all test case specifications from the products provided by Ericsson.
It is important to have in mind that the tool is only able to suggest test scripts
that the model has trained with, i.e. test scripts belonging to the 72 C# test
scripts that were provided as data in the beginning of the project.
Chapter 5
Discussion
In this chapter the results of the proposed approach and the different perfor-
mance metrics are discussed. The chapter includes a section discussing the
threats to validity, limitations and challenges faced when conducting this study.
Also, the sustainability and ethical aspects of this thesis work are reviewed.
The chapter closes with a section presenting the possibilities for future work
of this thesis.
38
CHAPTER 5. DISCUSSION 39
were promising, showing that the automatic test case generation approach has
achieved high accuracy and F1 scores. This means that the approach has man-
aged to produce relevant test scripts for the selected test case specifications.
The best performance results were obtained when using LinearSVC as clas-
sifier with data augmentation on the training set, yielding a best F1 score of
89.3%. Moreover, the tool, presented in Section 4.3, which implements the
proposed approach, can aid the testers by relieving some of the manual op-
erations performed on test case specifications. Despite the promising results,
there are some scenarios where the proposed approach can not fully replace
manual test case generation, which will be further discussed in this section.
It is interesting to discuss how the data have effected the results of this study.
One important aspect of the data in this project is that there are no one-to-one
mappings between the test case specifications and the test scripts. Figure 4.1
shows that the majority of the test case specifications (97/130) are mapped to
only one test script. Since there only exists 72 test scripts in the whole data set,
it can be inferred that some test scripts are used for more than one test case.
The model used in the proposed approach is trained with only these 72 C# test
scripts and therefore, when the user selects a new test case specification, the
tool is only able to suggest test scripts that belong to the set of 72 scripts. Thus,
the tool can replace manual test case generation to a large extent as long as the
desired outcome is a subset of the 72 test scripts. The model can not suggest
unseen test scripts. If that is the case, the testing engineers must generate
the script manually. The proposed approach acts as an important first step in
supporting a fully automated test case generation process. As discussed, for
some cases, human intervention is needed to generate the test cases. However,
the model can be extended further by including more data in the training phase.
In Table 4.3, it can be seen that the number of true negatives is large, in com-
parison to the other values, for all test case specifications. A true negative
represents that the model correctly omitted a test script from the prediction.
Since the data in this project is imbalanced, accuracy is a poor measurement
to evaluate the model. Considering the default accuracy function, shown in
Equation 3.2, the score increases in conjunction with the number of true neg-
atives. With an imbalanced data set, this would lead to misleading results.
Therefore, a balanced accuracy function was calculated instead where true
positive and true negative predictions are normalized. Furthermore, in many
studies, the number of false positives is regarded as having a negative impact
on the quality of the model. However, in this study, a falsely predicted test
script could still be of use to the testers at Ericsson since it could give insights
40 CHAPTER 5. DISCUSSION
on test scripts that may not have been obvious for the specific test case. This
means that the number of false positives does not pose as big of a negative
impact on the evaluation of the model as one might think. The number of
false positives has a direct affect on the Precision score, as a lower rate of false
positives will lead a higher Precision. But since the goal in this study is not
to minimize the number of false positives, Precision alone is not sufficient to
describe the quality of the approach. In this study, it is more important to get
as few false negatives as possible. The number of false negatives denotes the
number of scripts that the model did not manage to predict, which is some-
thing we want to avoid. The model has managed to achieve this as can be seen
in Table 4.3, where the model failed to predict the test scripts only for two test
case specifications. This affects the Recall score since the lower the number
of false negatives is, the higher Recall score is. However, similar to Precision,
Recall alone is not sufficient to determine the performance of the approach. If
the model were to suggest all of the test scripts, then the model would achieve
a perfect Recall score. This is also not a desired property of the model. There-
fore, F1 score was also calculated as a harmonic mean of Precision and Recall
to give a more balanced measure of quality of the approach. Moreover, all the
performance measurements used in this study are affected by the number of
true positives, i.e. number of test scripts that were correctly predicted by the
model. The number of true positives is considered to be the most important
factor when evaluating the quality of the model since it is highly important
that the model manages to predict the correct test scripts for a specific test
case specification. With all this information in mind, it is possible to say that
using only one of these metrics to determine the approach’s performance is
not enough. By combining and analyzing all of them, one can acquire more
useful insights about the approach.
Two classifiers, LinearSVC and KNN, were used during the experiments and
their results were compared. According to the results in Tables 4.1 and 4.2,
LinearSVC outperforms KNN and the highest F1 score is achieved using the
LinearSVC classifier. One reason for why LinearSVC outperformed KNN is
the large number of features in the data, i.e. the high dimensionality. With
large numbers of features, KNN tends to have a deterioration in the perfor-
mance. As described in Section 2.4.1.2, KNN only uses observations that are
near the test observation, for which a prediction must be made, to make the
prediction. In a high dimensional space, the K observations that are nearest to
the given test observation may be far away from it, resulting in a poor KNN fit.
On the other hand, LinearSVC tends to perform well with a limited set of points
CHAPTER 5. DISCUSSION 41
worthiness of its results and it should be considered during all phases of the
study. Runeson and Höst [57] describe four aspects of validity, which are:
Construct validity, Internal validity, External validity, and Reliability. These
validity aspects and the threats to them are discussed in this section from the
perspective of this study.
The aspect of construct validity addresses the relation between theory and
observation [58]. This aspect concerns whether the measurements that are
studied really represent what the researcher intended to investigate [57]. One
possible construct validity threat in the present study is the source used for
generating the test cases. In this study, only test case specifications are used in
order to generate the test cases. It is possible to use other types of specifica-
tions written in natural language, such as Software Requirement Specifications
(SRS) or Use Case Specifications. Combining and analyzing different specifi-
cation documents to generate test cases may yield better results or give a more
accurate picture of the intended research. However, other natural language-
based specifications were not available for this testing project and deriving
them would have been very time-consuming. Furthermore, the aspect of con-
struct validity has been taken into account during the course of this project.
For example, it was concluded that accuracy would be a poor measurement of
the approach’s performance, i.e. it would have poor construct validity. There-
fore, other more suitable performance metrics were selected, such as balanced
accuracy and F1 score, leading to a better construct validity.
Internal validity is the aspect of validity that concerns the conclusions of the
study [57]. The main threat to internal validity in this study is the dimension-
ality of the data set as it was small and had a low frequency of labels. The
data set was divided into training and testing sets only once and that was done
manually to ensure that all labels were present in the training set, as described
in Section 3.5.2. It was not possible to obtain a standard deviation of the per-
formance measurements since the performance measurements were obtained
for only one division of the data. This might suggest that the approach is de-
pendent on the division of the data and might perform differently for different
data sets. Another potential threat to internal validity is the structure of the
test case specifications. In this project, NLP techniques were performed on a
set of semi-structured test case specifications where NLTK and Stanford POS-
tagger were used to parse and analyze the documents in an efficient amount of
time. There is no guarantee that for more complicated structures of test case
specifications, NLTK and Stanford POS-tagger will perform similarly, which
may impact the results of the approach.
CHAPTER 5. DISCUSSION 43
External validity addresses the ability of generalizing the findings [57, 58].
The proposed approach has been applied on a limited number of test case spec-
ifications in only one industrial testing project. Despite that, the approach’s
findings are relevant for other cases since the approach is applicable to other
contexts with common characteristics. Also, the approach can be extended to
other testing domains and testing levels, for example unit, regression and sys-
tem testing levels. To allow for comparison to other approaches in the testing
domain, all necessary context information has been provided in this report.
Lastly, the reliability of the study concerns the extent to which the data and
the analysis are dependent on the specific researcher. Runeson and Höst [57]
mean that if this study were to be conducted by a different researcher, the out-
come should be the same. To avoid threats to reliability as much as possible
and to better allow replication of the study, the methodology of the approach
is described in detail in Chapter 3. However, there exist some issues in this
study that could pose as threats to reliability. The major reliability threat in
this study is the way the features are extracted from the test case specifica-
tions. As described in Section 3.4.1, the test case specifications were first
observed and examined to determine which features are relevant to extract.
This makes the feature extraction process somewhat subjective and specific
for this testing domain, which may influence the outcome of the study. The
ten most frequent features shown in Figure 4.2 would probably be different if
other conditions were used for extracting the features. Furthermore, since the
test case specifications are written in natural language by a group of testers,
these specifications can suffer from spelling issues. If an important keyword
in the specification is misspelled, one or several times, the approach will not
be able to select this keyword as a feature for the test case specification. This
issue can directly impact the results of the study since the feature extraction
stage is an essential part of the whole approach.
A developed system and its products can therefore affect all three components
of sustainability: social, environmental and economic. Therefore, it is always
important to keep sustainability in mind when developing new products. Op-
timizing the software testing process will in turn optimize the whole software,
since testing is a large part of its life cycle. Ensuring higher quality products
and faster deliveries will amplify the software’s effect on sustainability. How
sustainability is affected and to what degree depends on the domain of the
system under development. For example, if a software system has a positive
affect on the environment, then optimizing and improving the testing process
of that system will increase its positive environmental effect. In this study, the
products used are part of the telecommunication domain where the host com-
pany is trying to develop sustainable artificial intelligence solutions that scale
globally.
Vinuesa et al. [59] state that artificial intelligence has a great effect on the
achievement of the Sustainable Development Goals (SDG) since artificial in-
telligence has potential to enable the accomplishment of 134 (79%) targets
across all the goals. However, artificial intelligence may present some limi-
tations in the achievement of 59 targets [59]. The work in this degree project
make use of artificial intelligence to help in automating the software testing
process which leads to reduction in costs and resources. From an economic
perspective, the automation of software testing will benefit the company since
it can potentially lead to cutting costs in the long run. This will increase the
revenue of the company and also increase the possibility of enhancing the
product’s quality. However, from a social point of view, automating differ-
ent tasks in several domains can lead to reduced need of workers. Automa-
tion can replace people and thus lead to increased unemployment in society.
Nevertheless, this is not always the case in the testing domain since artificial
intelligence may replace the human work but not the humans themselves. Ex-
perienced and educated people with domain knowledge are always needed to
design, implement, and validate the results of such automation algorithms and
methods. Moreover, the results of this thesis will contribute towards test engi-
neers spending more time on more valuable and rewarding tasks, rather than
repetitive jobs. The goal of the approach is to aid the testers in their work
rather than replace them in any way.
From an ethical point of view, one has to consider the extent to which such
automation algorithms and approaches should be trusted. In some scenarios,
where vital situations can be involved, a fully automated testing process could
raise ethical issues. The tool implemented in this degree project is intended to
CHAPTER 5. DISCUSSION 45
act as a supplement for the test engineers. However, this approach can be de-
veloped further to an extent where test engineers might fully rely on it without
assessing the generated test cases. This is problematic in situations where hu-
man lives are directly or indirectly involved, for example in self-driving cars.
Therefore, human intuition and inductive reasoning cannot be fully replaced
and is required to ensure safeness of the developed system.
Conclusions
Automatic test case generation is one way of optimizing the software testing
process, which can help in minimizing the required time and effort for testers.
In this thesis, an approach for automatic test case generation from natural lan-
guage test case specifications has been designed, applied, and evaluated. The
effectively implemented approach consists of a pipeline of three stages. The
first stage of the approach involves parsing and analyzing the test case specifi-
cation documents using different NLP techniques. Feature vectors, containing
keywords representing the specification are generated during this stage. In the
next stage, the feature vectors are mapped to label vectors consisting of C#
test scripts file names. The feature and label vectors are used as input and out-
put, respectively, in the last stage in the text classification process. Finally, the
proposed approach is applied and evaluated on an industrial testing project at
Ericsson AB in Sweden and the approach has been implemented as a tool for
aiding testers in the process of test case generation.
Regarding the research questions of this thesis, it is concluded that this de-
gree project has shown how NLP techniques can be incorporated in the test
case generation process and that the proposed approach has achieved promis-
ing results. The results of evaluating the approach showed that the proposed
approach has managed to produce relevant test scripts for the selected test case
specifications, achieving high accuracy and F1 scores. The best F1 score of
89.3% is achieved when running with the OVA strategy with LinearSVC clas-
sifier and performing data augmentation on the training set. Furthermore, it
was clear that LinearSVC outperformed the KNN classifier given the data set
of this project. It was discussed that a possible reason for this is the properties
47
48 CHAPTER 6. CONCLUSIONS
of the data in this project. The data contains a large number of features, i.e.
the data is high dimensional, and the training set is small, which are conditions
where LinearSVC outperforms KNN.
Future research of this study may include applying the proposed approach on
more products, i.e. more data, and evaluating it in different testing domains to
get more helpful insights about the approach. Another potential future direc-
tion of this study is to investigative the possibility of producing totally new test
scripts containing unseen code for executing the test cases. Such extension of
the approach would increase the approach’s ability in dealing with unseen test
case specification documents.
To conclude this thesis, it is evident that there is potential in using NLP tech-
niques on natural language test case specifications to automatically generate
test cases. The proposed approach has managed to suggest relevant test scripts
for the selected test case specifications. Furthermore, the approach has poten-
tial in replacing manual testing since the tool, which implements the approach,
can aid the testers by relieving some of the manual operations performed on
test case specifications.
Bibliography
49
50 BIBLIOGRAPHY
[59] R. Vinuesa et al. “The role of artificial intelligence in achieving the Sus-
tainable Development Goals”. In: Nature Communications 11.1 (2020),
pp. 1–10. Nature Publishing Group.
[60] ML. Zhang and ZH. Zhou. “A review on multi-label learning algo-
rithms”. In: IEEE transactions on knowledge and data engineering 26.8
(2013), pp. 1819–1837. IEEE.
TRITA -EECS-EX-2020:592
www.kth.se