DoctoralDissertation MirRiyanulIslam

Mälardalen University Press Dissertations
No. 397
EXPLAINABLE ARTIFICIAL INTELLIGENCE FOR ENHANCING

TRANSPARENCY IN DECISION SUPPORT SYSTEMS
Mir Riyanul Islam
2024
School of Innovation, Design and Engineering

Copyright © Mir Riyanul Islam, 2024
ISBN 978-91-7485-626-2
ISSN 1651-4238
Printed by E-Print AB, Stockholm, Sweden
Mälardalen University Press Dissertations
No. 397
EXPLAINABLE ARTIFICIAL INTELLIGENCE FOR ENHANCING

TRANSPARENCY IN DECISION SUPPORT SYSTEMS
Mir Riyanul Islam
Akademisk avhandling
som för avläggande av teknologie doktorsexamen i datavetenskap vid

Akademin för innovation, design och teknik kommer att offentligen försvaras
tisdagen den 30 januari 2024, 13.15 i Gamma, Mälardalens universitet, Västerås.
Fakultetsopponent: Professor Kerstin Bach, Norwegian

University of Science and Technology (NTNU)
Akademin för innovation, design och teknik

Abstract
Artificial Intelligence (AI) is recognized as advanced technology that assists in decision-making
processes with high accuracy and precision. However, many AI models are generally appraised as
black boxes due to their reliance on complex inference mechanisms. The intricacies of how and why
these AI models reach a decision are often not comprehensible to human users, resulting in concerns
about the acceptability of their decisions. Previous studies have shown that the lack of associated
explanation in a human-understandable form makes the decisions unacceptable to end-users. Here, the
research domain of Explainable AI (XAI) provides a wide range of methods with the common theme
of investigating how AI models reach to a decision or explain it. These explanation methods aim to
enhance transparency in Decision Support Systems (DSS), particularly crucial in safety-critical domains
like Road Safety (RS) and Air Traffic Flow Management (ATFM). Despite ongoing developments, DSSs
are still in the evolving phase for safety-critical applications. Improved transparency, facilitated by XAI,
emerges as a key enabler for making these systems operationally viable in real-world applications,
addressing acceptability and trust issues. Besides, certification authorities are less likely to approve
the systems for general use following the current mandate of Right to Explanation from the European
Commission and similar directives from organisations across the world. This urge to permeate the
prevailing systems with explanations paves the way for research studies on XAI concentric to DSSs.
To this end, this thesis work primarily developed explainable models for the application domains of
RS and ATFM. Particularly, explainable models are developed for assessing drivers' in-vehicle mental
workload and driving behaviour through classification and regression tasks. In addition, a novel method
is proposed for generating a hybrid feature set from vehicular and electroencephalography (EEG)
signals using mutual information (MI). The use of this feature set is successfully demonstrated to
reduce the efforts required for complex computations of EEG feature extraction. The concept of MI was
further utilized in generating human-understandable explanations of mental workload classification.
For the domain of ATFM, an explainable model for flight take-off time delay prediction from historical
flight data is developed and presented in this thesis. The gained insights through the development
and evaluation of the explainable applications for the two domains underscore the need for further
research on the advancement of XAI methods.
In this doctoral research, the explainable applications for the DSSs are developed with the additive
feature attribution (AFA) methods, a class of XAI methods that are popular in current XAI
research. Nevertheless, there are several sources from the literature that assert that feature
attribution methods often yield inconsistent results that need plausible evaluation. However, the
existing body of literature on evaluation techniques is still immature offering numerous suggested
approaches without a standardized consensus on their optimal application in various scenarios. To
address this issue, comprehensive evaluation criteria are also developed for AFA methods as the
literature on XAI suggests. The proposed evaluation process considers the underlying characteristics
of the data and utilizes the additive form of Case-based Reasoning, namely AddCBR. The AddCBR
is proposed in this thesis and is demonstrated to complement the evaluation process as the baseline to
compare the feature attributions produced by the AFA methods. Apart from generating an explanation
with feature attribution, this thesis work also proposes the iXGB-interpretable XGBoost. iXGB
generates decision rules and counterfactuals to support the output of an XGBoost model thus
improving its interpretability. From the functional evaluation, iXGB demonstrates the potential to be
used for interpreting arbitrary tree-ensemble methods.
In essence, this doctoral thesis initially contributes to the development of ideally evaluated explainable
models tailored for two distinct safety-critical domains. The aim is to augment transparency within
the corresponding DSSs. Additionally, the thesis introduces novel methods for generating more
comprehensible explanations in different forms, surpassing existing approaches. It also showcases a
robust evaluation approach for XAI methods.
ISBN 978-91-7485-626-2
ISSN 1651-4238
To my parents and family ...
Acknowledgements
This long journey of my doctoral studies would not be possible without the blessings
of the Almighty and the guidance, inspiration, and help of numerous people.
I would like to thank my supervisors, Prof. Mobyen Uddin Ahmed and Prof.
Shahina Begum, for providing me with the opportunity to pursue my doctoral studies
and deliberately guiding me through the process. I am deeply grateful for your
indispensable support and supervision throughout the journey.
I am immensely obliged to Prof. Rosina Weber for imparting invaluable knowledge
and insights through her mentorship during the crucial phase of my doctoral studies.
My special thanks to Dr. Shaibal Barua, from whom I am inspired and have
learned much over the years of my doctoral study. I would like to thank my colleagues,
Dr. Hamidur Rahman, Dr. Waleed Jmoona, Arnab Barua, and Md Rakibul Islam,
for their collaboration and moral support. I extend special gratitude to Dr. Shahriar
Hasan and Md Aquif Rahman for going above and beyond as colleagues and for their
unwavering support as friends and brothers during my doctoral studies.
I am thankful to my fellow doctoral students, colleagues and the administrative
staff at Mälardalen University, for their corresponding support. My sincere gratitude
goes to Prof. Sasikumar Punnekkat for his invaluable time and insightful feedback
in reviewing my doctoral research proposal and dissertation.
I would like to express my deep gratitude to the faculty examiner, Prof. Kerstin
Bach, and the grading committee members, Prof. Mark Sebastian Dougherty, Prof.
Fredrik Heintz, and Adj. Prof. Rafia Inam, for kindly accepting the invitation and
dedicating part of your valuable time to review the studies. It is truly my honour to
have you as the reviewers of this dissertation.
Most importantly, I would like to express my deepest and heartfelt gratitude to
my mother, Prof. Anjuman Ara Begum, my father, Mir Rashedul Islam, and my
sister, Rifa Zumana, for always standing by me and supporting me throughout this
journey from several thousand miles away. I especially acknowledge my mother, from
whom I got the inspiration to pursue my doctoral studies.
I am intensely grateful to my wife, Nuzat Naila Islam, for her constant support,
love, and encouragement. Thank you for bearing with me through the most
challenging phase of my life to date, as much as our lives, for listening patiently,
and for being kind to me.
I would like to express my heartfelt gratitude to all the teachers who played
a pivotal role in guiding and educating me, shaping my academic journey from
pre-school through university, and incrementally preparing me for the degree of
doctorate. Also, I want to express my sincere gratitude to all my friends from home
and abroad who supported and encouraged me during my doctoral studies.
i
The research studies presented in this doctoral thesis have received funding
from the following projects; i) ARTIMATION 1 , under SESAR Joint Undertaking,
(Grant Agreement No. 894238), ii) SIMUSAFE 2 , (Grant Agreement No. 723386),
both funded by the European Union’s Horizon 2020 Research and Innovation
Programme, and iii) BrainSafeDrive 3 , co-funded by the Vetenskapsrådet - The
Swedish Research Council and the Ministero dell’Istruzione dell’Università e della
Ricerca della Repubblica Italiana, under the Italy-Sweden Cooperation Program. I
extend my sincere gratitude to all the collaborators from these projects, and it has
been a privilege for me to be part of different research communities.
Mir Riyanul Islam

December 2023
Västeräs, Sweden
1 https://www.artimation.eu
2 https://www.cordis.europa.eu/project/id/723386
3 https://www.brainsafedrive.brainsigns.com
ii
Abstract
Artificial Intelligence (AI) is recognized as advanced technology that assists in

decision-making processes with high accuracy and precision. However, many AI
models are generally appraised as black boxes due to their reliance on complex
inference mechanisms. The intricacies of how and why these AI models reach a
decision are often not comprehensible to human users, resulting in concerns about
the acceptability of their decisions. Previous studies have shown that the lack
of associated explanation in a human-understandable form makes the decisions
unacceptable to end-users. Here, the research domain of Explainable AI (XAI)
provides a wide range of methods with the common theme of investigating how
AI models reach to a decision or explain it. These explanation methods aim to
enhance transparency in Decision Support Systems (DSS), particularly crucial in
safety-critical domains like Road Safety (RS) and Air Traffic Flow Management
(ATFM). Despite ongoing developments, DSSs are still in the evolving phase for
safety-critical applications. Improved transparency, facilitated by XAI, emerges as a
key enabler for making these systems operationally viable in real-world applications,
addressing acceptability and trust issues. Besides, certification authorities are less
likely to approve the systems for general use following the current mandate of Right to
Explanation from the European Commission and similar directives from organisations
across the world. This urge to permeate the prevailing systems with explanations
paves the way for research studies on XAI concentric to DSSs.
To this end, this thesis work primarily developed explainable models for the
application domains of RS and ATFM. Particularly, explainable models are developed
for assessing drivers’ in-vehicle mental workload and driving behaviour through
classification and regression tasks. In addition, a novel method is proposed
for generating a hybrid feature set from vehicular and electroencephalography
(EEG) signals using mutual information (MI). The use of this feature set is
successfully demonstrated to reduce the efforts required for complex computations
of EEG feature extraction. The concept of MI was further utilized in generating
human-understandable explanations of mental workload classification. For the
domain of ATFM, an explainable model for flight take-off time delay prediction
from historical flight data is developed and presented in this thesis. The gained
insights through the development and evaluation of the explainable applications for
the two domains underscore the need for further research on the advancement of XAI
methods.
In this doctoral research, the explainable applications for the DSSs are developed
with the additive feature attribution (AFA) methods, a class of XAI methods that
are popular in current XAI research. Nevertheless, there are several sources from
iii
the literature that assert that feature attribution methods often yield inconsistent
results that need plausible evaluation. However, the existing body of literature
on evaluation techniques is still immature offering numerous suggested approaches
without a standardized consensus on their optimal application in various scenarios.
To address this issue, comprehensive evaluation criteria are also developed for
AFA methods as the literature on XAI suggests. The proposed evaluation process
considers the underlying characteristics of the data and utilizes the additive form of
Case-based Reasoning, namely AddCBR. The AddCBR is proposed in this thesis and
is demonstrated to complement the evaluation process as the baseline to compare
the feature attributions produced by the AFA methods. Apart from generating
an explanation with feature attribution, this thesis work also proposes the iXGB –
interpretable XGBoost. iXGB generates decision rules and counterfactuals to support
the output of an XGBoost model thus improving its interpretability. From the
functional evaluation, iXGB demonstrates the potential to be used for interpreting
arbitrary tree-ensemble methods.
In essence, this doctoral thesis initially contributes to the development of ideally
evaluated explainable models tailored for two distinct safety-critical domains. The
aim is to augment transparency within the corresponding DSSs. Additionally, the
thesis introduces novel methods for generating more comprehensible explanations in
different forms, surpassing existing approaches. It also showcases a robust evaluation
approach for XAI methods.
iv
Sammanfattning
Artificiell intelligens (AI) är erkänt som en avancerad teknik som hjälper till att fatta
beslut med hög noggrannhet och precision. Många AI-modeller betraktas dock som
svarta lådor på grund av att de bygger på komplexa slutledningsmekanismer. Hur och
varför dessa AI-modeller når fram till ett beslut är ofta inte begripligt för mänskliga
användare, vilket leder till oro för att deras beslut inte är godtagbara. Tidigare
studier har visat att avsaknaden av tillhörande förklaringar i en för människor
begriplig form gör besluten oacceptabla för slutanvändarna. Forskningsområdet
förklarlig AI (XAI) erbjuder ett brett utbud av metoder med det gemensamma
temat att undersöka hur AI-modeller når fram till ett beslut eller förklarar det.
Dessa förklaringsmetoder syftar till att öka transparensen i beslutsstödsystem
(DSS), vilket är särskilt viktigt inom säkerhetskritiska områden som vägsäkerhet
och flygtrafikflödeshantering. Trots den pågående utvecklingen befinner sig DSS
fortfarande i en utvecklingsfas för säkerhetskritiska tillämpningar. Förbättrad
transparens, som underlättas av XAI, framstår som en viktig faktor för att göra dessa
system praktiskt användbara i verkliga tillämpningar, och för att hantera acceptans-
och förtroendefrågor. Dessutom är det mindre troligt att certifieringsmyndigheterna
godkänner systemen för allmän användning efter det nuvarande mandatet Rätt till
förklaring från Europeiska kommissionen och liknande direktiv från organisationer
över hela världen. Denna önskan att genomsyra de rådande systemen med
förklaringar banar väg för forskningsstudier om XAI som är koncentrerade till
beslutsstödsystem.
För detta ändamål har denna avhandling främst utvecklat förklarbara modeller
för tillämpningsområdena vägsäkerhet och flygtrafikflödeshantering. I synnerhet
utvecklas förklarbara modeller för att bedöma förarnas mentala arbetsbelastning i
fordonet och körbeteende genom klassificerings- och regressionsuppgifter. Dessutom
föreslås en ny metod för att generera en hybridfunktionsuppsättning från fordons- och
elektroencefalografi (EEG) med hjälp av ömsesidig information (MI). Användningen
av denna funktionsuppsättning har framgångsrikt demonstrerats för att minska de
insatser som krävs för komplexa beräkningar av EEG-funktionsextraktion. Begreppet
MI användes vidare för att generera förklaringar av klassificeringen av mental
arbetsbelastning som är begripliga för människor. För flygtrafikflödeshantering
utvecklas och presenteras i denna avhandling en förklaringsmodell för förutsägelse
av tidsfördröjning vid start av flyg från historiska flygdata. De insikter som erhållits
genom utvecklingen och utvärderingen av de förklarbara tillämpningarna för de
två domänerna understryker behovet av ytterligare forskning om utvecklingen av
XAI-metoder.
v
I denna doktorsavhandling utvecklas de förklarbara applikationerna för DSS med
hjälp av additive feature attribution (AFA) metoder, en klass av XAI-metoder som är
populära inom aktuell XAI-forskning. Det finns dock flera källor i litteraturen som
hävdar att funktionsattributionsmetoder ofta ger inkonsekventa resultat som behöver
utvärderas på ett trovärdigt sätt. Den befintliga litteraturen om utvärderingstekniker
är dock fortfarande omogen och erbjuder många föreslagna tillvägagångssätt utan
ett standardiserat samförstånd om deras optimala tillämpning i olika scenarier.
För att ta itu med detta problem har omfattande utvärderingskriterier även
utvecklats för AFA-metoder som litteraturen om XAI föreslår. Den föreslagna
utvärderingsprocessen tar hänsyn till de underliggande egenskaperna hos data
och använder den additiva formen av case-based reasoning, nämligen AddCBR.
AddCBR föreslås i denna avhandling och demonstreras för att komplettera
utvärderingsprocessen för att jämföra de funktionsattributioner som produceras av
AFA-metoderna. Förutom att generera en förklaring med funktionstillskrivning
föreslår denna avhandling också iXGB – interpretable XGBoost. iXGB genererar
beslutsregler och kontrafakta för att stödja utdata från en XGBoost-modell och
därmed förbättra dess tolkningsbarhet. Den funktionella utvärderingen visar att
iXGB har potential att användas för att tolka godtyckliga träd-ensemble-metoder.
Sammanfattningsvis bidrar denna doktorsavhandling initialt till utvecklingen
av idealiskt utvärderade förklarbara modeller skräddarsydda för två distinkta
säkerhetskritiska domäner. Syftet är att öka transparensen inom de motsvarande
beslutsstödsystem. Dessutom introducerar avhandlingen nya metoder för att
generera mer begripliga förklaringar i olika former, vilket överträffar befintliga
tillvägagångssätt. Den visar också en robust utvärderingsmetod för XAI-metoder.
vi
List of Publications
Publications included in the Thesis†‡ –
A1. Islam, M. R., Barua, S., Ahmed, M. U., Begum, S., & Di Flumeri, G.
(2019). Deep Learning for Automatic EEG Feature Extraction: An Application
in Drivers’ Mental Workload Classification. In L. Longo & M. C. Leva
(Eds.), Human Mental Workload: Models and Applications. H-WORKLOAD
2019. Communications in Computer and Information Science (pp. 121–135).
Springer Nature Switzerland.
A2. Islam, M. R., Barua, S., Ahmed, M. U., Begum, S., Aricò, P., Borghini, G.,
& Di Flumeri, G. (2020). A Novel Mutual Information Based Feature Set for
Drivers’ Mental Workload Evaluation using Machine Learning. Brain Sciences,
10 (8), 551.
B. Islam, M. R., Ahmed, M. U., Barua, S., & Begum, S. (2022). A Systematic
Review of Explainable Artificial Intelligence in Terms of Different Application
Domains and Tasks. Applied Sciences, 12 (3), 1353.
C. Islam, M. R., Ahmed, M. U., & Begum, S. (2021). Local and Global
Interpretability using Mutual Information in Explainable Artificial Intelligence.
Proceedings of the 8th International Conference on Soft Computing & Machine
Intelligence (ISCMI 2021), 191–195.
D. Islam, M. R., Ahmed, M. U., & Begum, S. (2023). Interpretable Machine
Learning for Modelling and Explaining Car Drivers’ Behaviour: An Exploratory
Analysis on Heterogeneous Data. Proceedings of the 15th International
Conference on Agents and Artificial Intelligence (ICAART 2023), 392–404.
E. Islam, M. R., Weber, R. O., Ahmed, M. U., & Begum, S. (2023).
Investigating Additive Feature Attribution for Regression [Under Review].
Artificial Intelligence.
F. Islam, M. R., Ahmed, M. U., & Begum, S. (2024). iXGB: Improving the
Interpretability of XGBoost using Decision Rules and Counterfactuals [Under
Review]. 16th International Conference on Agents and Artificial Intelligence
(ICAART 2024).
† This thesis is a comprehensive summary of the listed papers that are referenced in the text
using corresponding alphabetical markers.

‡ To comply with the thesis layout, the papers have been reformatted and reprinted with
permission from the copyright holders, with typos corrected.
vii
Publications not included in the Thesis –
Journal
• Degas, A., Islam, M. R., Hurter, C., Barua, S., Rahman, H., Poudel, M.,
Ruscio, D., Ahmed, M. U., Begum, S., Rahman, M. A., Bonelli, S., Cartocci,
G., Di Flumeri, G., Borghini, G., Babiloni, F., & Aricó, P. (2022). A Survey
on Artificial Intelligence (AI) and eXplainable AI in Air Traffic Management:
Current Trends and Development with Future Research Trajectory. Applied
Sciences, 12 (3), 1295.
• Hurter, C., Degas, A., Guibert, A., Durand, N., Ferreira, A., Cavagnetto, N.,
Islam, M. R., Barua, S., Ahmed, M. U., Begum, S., Bonelli, S., Cartocci,
G., Di Flumeri, G., Borghini, G., Babiloni, F., & Aricó, P. (2022). Usage of
More Transparent and Explainable Conflict Resolution Algorithm: Air Traffic
Controller Feedback. Transportation Research Procedia, 66, 270–278.
• Ahmed, M. U., Islam, M. R., Barua, S., Hök, B., Jonforsen, E., & Begum,
S. (2021). Study on Human Subjects – Influence of Stress and Alcohol in
Simulated Traffic Situations. Open Research Europe, 1, 83.
Conference/Workshop
• Gorospe, J., Hasan, S., Islam, M. R., Gómez, A. A., Girs, S., &
Uhlemann, E. (2023). Analyzing Inter-Vehicle Collision Predictions during
Emergency Braking with Automated Vehicles. Proceedings of the 19th
International Conference on Wireless and Mobile Computing, Networking and
Communications (WiMob 2023), 411–418.
• Jmoona, W., Ahmed, M. U., Islam, M. R., Barua, S., Begum, S., Ferreira, A.,
& Cavagnetto, N. (2023). Explaining the Unexplainable: Role of XAI for Flight
Take-Off Time Delay Prediction. In I. Maglogiannis, L. Iliadis, J. MacIntyre,
& M. Dominguez (Eds.), Artificial Intelligence Applications and Innovations.
AIAI 2023. IFIP Advances in Information and Communication Technology
(pp. 81–93). Springer Nature Switzerland.
• Ahmed, M. U., Barua, S., Begum, S., Islam, M. R., & Weber, R. O. (2022).
When a CBR in Hand Better than Twins in the Bush. In P. Reuss & J.
Schönborn (Eds.), Proceedings of the 4th Workshop on XCBR: Case-based
Reasoning for the Explanation of Intelligent Systems (XCBR) co-located with
the 30th International Conference on Case-Based Reasoning (ICCBR 2022)
(pp. 141–152). CEUR.
• Islam, M. R., Barua, S., Begum, S., & Ahmed, M. U. (2019). Hypothyroid
Disease Diagnosis with Causal Explanation using Case-based Reasoning and
Domain-specific Ontology. In S. Kapetanakis & H. Borck (Eds.), Proceedings
of the Workshop on CBR in the Health Sciences (WHS) co-located with the 27th
International Conference on Case-Based Reasoning (ICCBR 2019) (pp. 87–97).
CEUR.
viii
Contents
PART I Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Research Goal and Objectives . . . . . . . . . . . . . . . . . . . . . 5
1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Mapping of Research Questions, Contributions and Papers 9
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Background and Related Works . . . . . . . . . . . . . . . . . . . . 11

2.1 Explainable Artificial Intelligence . . . . . . . . . . . . . . . . . . . 11
2.1.1 Concepts of Explainability . . . . . . . . . . . . . . . . . . 11
2.1.2 Explainable Artificial Intelligence Methods . . . . . . . . . 14
2.1.3 Evaluation of Explainable Artificial Intelligence Methods . 14
2.2 Road Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Air Traffic Flow Management . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.1 Driving and Physiological Datasets . . . . . . . . . . . . . 20
3.1.2 Flight Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.3 Ethical Considerations . . . . . . . . . . . . . . . . . . . . 21
3.2 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Classification and Regression Models . . . . . . . . . . . . 24
3.3.2 Explainable Artificial Intelligence Methods . . . . . . . . . 26
3.3.3 Evaluation Methods and Metrics . . . . . . . . . . . . . . 27
4 Explainable Artificial Intelligence for Decision Support

Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Applications in Road Safety . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Automated Method of Feature Extraction from
Electroencephalography Signals . . . . . . . . . . . . . . . 30
4.1.2 Hybrid Feature Set from Electroencephalography Signal
and Vehicular Data . . . . . . . . . . . . . . . . . . . . . . 31
ix
4.1.3 Explainable Model for Drivers’ Mental Workload
Classification . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.4 Explainable Model for Monitoring Driving Behaviour . . . 33
4.2 Application in Air Traffic Flow Management . . . . . . . . . . . . . 35
4.2.1 Explainable Model for Flight Take-Off Time Delay
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Development of XAI Methods and Evaluation Approach . . . . . . 37
4.3.1 Evaluation of Additive Feature Attribution Methods . . . 38
4.3.2 Interpretable XGBoost . . . . . . . . . . . . . . . . . . . . 40
5 Summary of the Included Papers . . . . . . . . . . . . . . . . . . . . 41

5.1 Paper A1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Paper A2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Paper B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Paper C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.5 Paper D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.6 Paper E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.7 Paper F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Discussions, Conclusion and Future Works . . . . . . . . . . . . . 49

6.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1.1 Discussion on Research Question 1 . . . . . . . . . . . . . 51
6.1.2 Discussion on Research Question 2 . . . . . . . . . . . . . 52
6.1.3 Discussion on Research Related Issues . . . . . . . . . . . 54
6.1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . 57
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
PART II Included Papers . . . . . . . . . . . . . . . . . . . . . . . . . 69
A1 Deep Learning for Automatic EEG Feature Extraction: An

Application in Drivers’ Mental Workload Classification . . . . 73
A1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
A1.2 Background and Related Works . . . . . . . . . . . . . . . . . . . . 75
A1.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 76
A1.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 76
A1.3.2 Data Collection and Processing . . . . . . . . . . . . . . . 77
A1.3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 78
A1.3.4 Classification of MWL . . . . . . . . . . . . . . . . . . . . 80
A1.4 Result and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 81
A1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
A2 A Novel Mutual Information Based Feature Set for Drivers’

Mental Workload Evaluation using Machine Learning . . . . . . 91
x
A2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A2.2 Background and Related Works . . . . . . . . . . . . . . . . . . . . 93
A2.2.1 Assessment of Drivers’ Mental Workload . . . . . . . . . . 94
A2.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 96
A2.3.1 Experimental Protocol . . . . . . . . . . . . . . . . . . . . 96
A2.3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . 97
A2.3.3 Mutual Information Based Feature Extraction . . . . . . . 102
A2.3.4 Prediction and Classification Models . . . . . . . . . . . . 105
A2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A2.4.1 Quantification of Drivers’ Mental Workload . . . . . . . . 107
A2.4.2 Drivers’ Mental Workload and Event Classification . . . . 107
A2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
A2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
B A Systematic Review of Explainable Artificial Intelligence in

Terms of Different Application Domains and Tasks . . . . . . . . 123
B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
B.2 Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . . 126
B.2.1 Stage of Explainability . . . . . . . . . . . . . . . . . . . . 127
B.2.2 Scope of Explainability . . . . . . . . . . . . . . . . . . . . 128
B.2.3 Input and Output . . . . . . . . . . . . . . . . . . . . . . . 128
B.3 Related Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
B.4 SLR Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
B.4.1 Planning the SLR . . . . . . . . . . . . . . . . . . . . . . . 130
B.4.2 Conducting the SLR . . . . . . . . . . . . . . . . . . . . . 131
B.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
B.5.1 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
B.5.2 Application Domains and Tasks . . . . . . . . . . . . . . . 137
B.5.3 Development of XAI in Different Application Domains . . 140
B.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B.6.1 Input Data and Models for Primary Task . . . . . . . . . . 157
B.6.2 Development of Explainable Models in Different
Application Domains . . . . . . . . . . . . . . . . . . . . . 158
B.6.3 Evaluation Metrics for Explainable Models . . . . . . . . . 160
B.6.4 Open Issues and Future Research Direction . . . . . . . . 160
B.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
C Local and Global Interpretability using Mutual Information

in Explainable Artificial Intelligence . . . . . . . . . . . . . . . . 179
C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
C.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 181
C.2.1 Data Acquisition and Preprocessing . . . . . . . . . . . . . 181
C.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 182
C.2.3 Explanation of Extracted Features . . . . . . . . . . . . . 183
C.2.4 Mental Workload Classification . . . . . . . . . . . . . . . 184
C.2.5 Explanation of Mental Workload Classification . . . . . . . 184
xi
C.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 185
C.3.1 Mental Workload Classification . . . . . . . . . . . . . . . 185
C.3.2 Global and Local Explanation . . . . . . . . . . . . . . . . 185
C.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
D Interpretable Machine Learning for Modelling and

Explaining Car Drivers’ Behaviour: An Exploratory Analysis
on Heterogeneous Data . . . . . . . . . . . . . . . . . . . . . . . . . . 191
D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
D.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 193
D.2.1 Experimental Protocol . . . . . . . . . . . . . . . . . . . . 193
D.2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . 194
D.2.3 Dataset Preparation . . . . . . . . . . . . . . . . . . . . . 196
D.2.4 Classifier and Explanation Models . . . . . . . . . . . . . . 199
D.2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
D.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 201
D.3.1 Exploratory Analysis . . . . . . . . . . . . . . . . . . . . . 202
D.3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 205
D.3.3 Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . 206
D.3.4 Proposed Interpretable System . . . . . . . . . . . . . . . . 208
D.4 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . 209
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
E Investigating Additive Feature Attribution for Regression . . 215

E.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
E.2 Background and Related Works . . . . . . . . . . . . . . . . . . . . 217
E.2.1 Formal Definition . . . . . . . . . . . . . . . . . . . . . . . 217
E.2.2 XAI Methods . . . . . . . . . . . . . . . . . . . . . . . . . 218
E.2.3 Evaluation of XAI Methods . . . . . . . . . . . . . . . . . 219
E.2.4 Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . 221
E.2.5 Case-based Reasoning . . . . . . . . . . . . . . . . . . . . . 221
E.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
E.3.1 Original Dataset . . . . . . . . . . . . . . . . . . . . . . . . 223
E.3.2 Capturing the Data Behaviours with a Clustering Method 223
E.3.3 Synthetic Data Generation . . . . . . . . . . . . . . . . . . 223
E.3.4 Implementation of Data Model . . . . . . . . . . . . . . . 224
E.3.5 Creating Additive CBR . . . . . . . . . . . . . . . . . . . . 224
E.3.6 Implementation of Additive Feature Attribution Methods . 224
E.3.7 Evaluation of Additive Feature Attribution Methods . . . 224
E.4 Implementation of the Proposed Approach and Evaluations . . . . . 225
E.4.1 Flight Delay Dataset . . . . . . . . . . . . . . . . . . . . . 225
E.4.2 Capturing the Data Behaviours with Density-based
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
E.4.3 Synthetic Data Generation . . . . . . . . . . . . . . . . . . 227
E.4.4 Implementation of Data Model for Regression . . . . . . . 229
E.4.5 Creating Additive CBR . . . . . . . . . . . . . . . . . . . . 230
E.4.6 Implementation of Additive Feature Attribution Methods . 234
xii
E.4.7 Evaluation of Additive Feature Attribution Methods . . . 235
E.5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . 240
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Appendix E.A Description of the Flight Delay Dataset . . . . . . . . . 248
Appendix E.B Selection of Optimal Number of Clusters . . . . . . . . . 250
Appendix E.C Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . 251
F iXGB: Improving the Interpretability of XGBoost using

Decision Rules and Counterfactuals . . . . . . . . . . . . . . . . . . 261
F.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
F.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
F.2 iXGB – Interpretable XGBoost . . . . . . . . . . . . . . . . . . . . 264
F.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
F.2.2 Extraction of Rules . . . . . . . . . . . . . . . . . . . . . . 265
F.2.3 Generation of Counterfactuals . . . . . . . . . . . . . . . . 266
F.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 266
F.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
F.3.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
F.4 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . . 268
F.4.1 Prediction Performance . . . . . . . . . . . . . . . . . . . . 268
F.4.2 Coverage of Decision Rule . . . . . . . . . . . . . . . . . . 270
F.4.3 Counterfactuals . . . . . . . . . . . . . . . . . . . . . . . . 271
F.5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . 272
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
xiii
List of Figures
1.1 Generalized mapping of the research questions, contributions, and the

included papers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Different stages of adding explainability to black box models. . . . . . . 12

2.2 Different scopes of adding explainability to black box models illustrated
with an example decision tree. . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 The inter-connected significant aspects of the research methodology

followed in this thesis work. . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Schematic diagram of developing explainable DSS for RS. . . . . . . . . 30

4.2 Illustration of shared information between EEG and vehicular signal
spaces (Islam et al., 2020). . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Decision Tree for detecting risky driving behaviour. . . . . . . . . . . . . 33
4.4 Decision Tree for detecting hurried driving behaviour. . . . . . . . . . . 34
4.5 Schematic diagram of developing explainable DSS for ATFM. . . . . . . 36
4.6 Performances of ETFMS, GBDT, RF, and XGBoost for flight TOT
delay prediction in terms of MAE measured at different time intervals
to EOBT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.7 Explanation for a single instance of flight TOT delay prediction with
feature contributions extracted from SHAP. . . . . . . . . . . . . . . . . 37
feature contributions extracted from LIME. . . . . . . . . . . . . . . . . 38
feature contributions extracted from DALEX. . . . . . . . . . . . . . . . 38
4.10 Schematic diagram of evaluating XAI methods for AFA using synthetic
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
A1.1 The experimental circuit is about 2.5 kilometres long along Bologna
roads. ©Springer Nature Switzerland AG 2019. . . . . . . . . . . . . . . 77
A1.2 Steps in the traditional feature extraction technique. ©Springer Nature
Switzerland AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A1.3 Network architecture of the CNN-AE for feature extraction. ©Springer
Nature Switzerland AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . 79
A1.4 Variation in classification accuracy with respect to the change of
threshold on feature importance values. ©Springer Nature Switzerland
AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
xv
A1.5 MWL classification results in terms of Sensitivity and Specificity.
©Springer Nature Switzerland AG 2019. . . . . . . . . . . . . . . . . . 82
A1.6 AUC-ROC curves for different classifiers with features extracted by
traditional methods and CNN-AE where models were trained using
10-fold cross validation. ©Springer Nature Switzerland AG 2019. . . . . 82
A1.7 AUC-ROC curves for different classifiers with features extracted
by traditional methods and CNN-AE where models were trained
using leave-one-out (participant) cross validation. ©Springer Nature
Switzerland AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A2.1 Summary of the experimental protocol. ©2020 by Islam et al. (CC BY

4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A2.2 Average MWL score and velocity of nine participating drivers in different
(a) road segments and (b) traffic hours. ©2020 by Islam et al. (CC BY
4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A2.3 Average MWL score and velocity with standard deviation calculated
from the data of nine participating drivers with respect to events.
©2020 by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . 101
A2.4 Illustration of shared information between EEG and vehicular signal
spaces. ©2020 by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . 103
A2.5 Calculated MI values between EEG and vehicular signal. ©2020 by
Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . 104
A2.6 The 10-fold Cross Validation (CV) score in terms of Mean Absolute
Error (MAE) for regression models: (a) Linear Regression (LnR) and
(b) Multilayer Perceptron (MLP). ©2020 by Islam et al. (CC BY 4.0). . 108
A2.7 The 10-fold CV score in terms of MAE for regression models: (a)
Random Forest (RF) and (b) Support Vector Machine (SVM). ©2020
by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . 108
A2.8 Receiver Operating Characteristic (ROC) curves for the best two
classifier models among Logistic Regression (LgR), MLP, SVM and RF.
A2.9 Maximum balanced accuracy in different CV method for MWL and
event classification using MI-based features by different classifier models:
(a) 10-fold CV and (b) Leave-One-Out (LOO)-subject CV. ©2020 by
Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B.1 Number of published articles (y-axis) on XAI made available through

four bibliographic databases in recent decades (x-axis). (a) Trend of
the number of publications from 1984 to 2020. (b) Specific number of
publications from 2018 to June 2021. ©2022 by Islam et al. (CC BY
4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
B.2 Percentage of the selected articles on different XAI methods for different
application (a) domains and (b) tasks. ©2022 by Islam et al. (CC BY
4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
B.3 Overview of the different concepts on developing methodologies for XAI,
adapted from the review studies by Vilone and Longo (2020, 2021a).
xvi
B.4 SLR methodology stages following the guidelines from Kitchenham and
Charters (2007). ©2022 by Islam et al. (CC BY 4.0). . . . . . . . . . . 130
B.5 Flow diagram of the research article selection process adapted from the
PRISMA flow chart by Moher et al. (2009). ©2022 by Islam et al. (CC
BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
B.6 Word cloud of the (a) author-defined keywords and (b) keywords
extracted from the abstracts through natural language processing.
B.7 Number of publications proposing new methods of XAI from different
countries of the world and the top 10 countries based on the publication
count. ©2022 by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . 137
B.8 Chord diagram (Tintarev et al., 2018) presenting the number of selected
articles published on the XAI methods and evaluation metrics from
different application domains for the corresponding tasks. ©2022 by
Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . 140
B.9 Number of the selected articles published from different application
domains and clustered on the basis of AI/ML model type, stage, scope,
and form of explanations. ©2022 by Islam et al. (CC BY 4.0). . . . . . 141
B.10 Venn diagram with the number of articles using different forms of data
to assess the functional validity of the proposed XAI methodologies.
B.11 Distribution of the selected articles based on the stage, scope, and form
of explanations. ©2022 by Islam et al. (CC BY 4.0). . . . . . . . . . . . 145
B.12 Different forms of explanations. ©2022 by Islam et al. (CC BY 4.0). . . 147
B.13 UpSet plot presenting the distribution of different methods of evaluating
the explainable systems. ©2022 by Islam et al. (CC BY 4.0). . . . . . . 153
B.14 Different methods of evaluating explanations, which were presented in
the selected articles with the number of studies given in parentheses.
C.1 Global explanation of mental workload classifier model with SHAP

values with bar plot (left) and mutual information illustrated with Chord
diagrams for six spectral feature groups (right). ©2021 IEEE. . . . . . 186
C.2 Example of a local explanation with SHAP. ©2021 IEEE. . . . . . . . . 186
D.1 The experimental route for simulation and track tests. A detailed
description is presented in Section D.2.1. ©2023 by SCITEPRESS (CC
BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
D.2 The car simulator developed with DriverSeat 650 ST was used for
conducting the simulation tests. ©2023 by SCITEPRESS (CC
BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
D.3 Event extraction using GPS coordinates. ©2023 by SCITEPRESS (CC
BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
D.4 GPS coordinates of a single lap driving colour-coded with respect to
different road structures. ©2023 by SCITEPRESS (CC BY-NC-ND 4.0).197
D.5 Confusion Matrix for both Risk and Hurry Classification. ©2023 by
SCITEPRESS (CC BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . 201
xvii
D.6 Average driving velocity in different laps. The two-sided Wilcoxon
signed-rank test demonstrates a significant difference in the simulator
and track driving with t = 0.0, p = 0.0156. ©2023 by SCITEPRESS
(CC BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
D.7 Average accelerator pedal position across all the laps and the two-sided
Wilcoxon signed-rank test demonstrate a significant difference in the
simulator and track driving with t = 0.0, p = 0.0156. ©2023 by
D.8 GPS coordinates with varying driving velocity for a random participant
in laps 1 – 6. ©2023 by SCITEPRESS (CC BY-NC-ND 4.0). . . . . . . 203
D.9 Feature importance values are extracted from GBDT, SHAP & LIME,
normalized and illustrated with horizontal bar charts for corresponding
classification tasks. ©2023 by SCITEPRESS (CC BY-NC-ND 4.0). . . 207
D.10 Low fidelity prototype of proposed drivers’ behaviour monitoring system
for simulated driving. ©2023 by SCITEPRESS (CC BY-NC-ND 4.0). . 208
E.1 Overview of the proposed approach of evaluating the additive feature

attribution methods for regression. . . . . . . . . . . . . . . . . . . . . . 223
E.2 Overview of the implementation of the proposed approach for evaluating
SHAP and LIME for regression. . . . . . . . . . . . . . . . . . . . . . . . 225
E.3 Illustration of the separation of the average values of target variable for
each cluster considering one to 11 clusters in the dataset. . . . . . . . . 226
E.4 Distribution of the target variable from the dataset with six clusters that
is the main choice for evaluating the additive feature attribution methods.228
E.5 Bar plot illustrating the JSD measures between the distributions of the
continuous features from the original and synthetic datasets. . . . . . . . 229
E.6 Feature ranks of the most important features from XGBoost and AddCBR.232
E.7 Impact on the prediction by changing the values of the continuous
features with (a) most and (b) least importance based on their
contribution to feature attribution from AddCBR. . . . . . . . . . . . . 234
E.8 Evaluation of feature attribution from SHAP and LIME considering
AddCBR as the baseline with scatter plots for CQV of the feature values
and contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
E.B.1 Different methods of selecting the optimal number of clusters. . . . . . . 250
E.C.1 Distribution of the target variable from the two- and eight-cluster datasets.251
E.C.2 Bar plot illustrating the JSD measures between the distributions of the
continuous features from the real and synthetic datasets with (a) two
and (b) eight clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
E.C.3 Feature ranks of the most important features from XGBoost and
AddCBR for the two-cluster dataset. . . . . . . . . . . . . . . . . . . . . 253
E.C.4 Feature ranks of the most important features from XGBoost and
AddCBR for the eight-cluster dataset. . . . . . . . . . . . . . . . . . . . 253
E.C.5 Evaluation of feature attribution from SHAP and LIME considering
AddCBR as the baseline with scatter plots for CQV of the feature values
and contributions for the (a) two- and (b) eight-cluster datasets. . . . . 256
F.1 Example of explanation generated for a single instance of flight TOT

delay prediction using LIME. . . . . . . . . . . . . . . . . . . . . . . . . 263
xviii
F.2 Overview of the mechanism of the proposed iXGB. . . . . . . . . . . . . 264
F.3 Prediction Performance of XGBoost in terms of MAE for flight delay
prediction with different numbers of features ranked by XGBoost feature
importance from two different subsets of the data. . . . . . . . . . . . . 267
F.4 Comparison of prediction performance of iXGB and LIME in terms of
MAE with three different datasets. . . . . . . . . . . . . . . . . . . . . . 269
xix
List of Tables
3.1 Summary of the datasets from the domain of RS. . . . . . . . . . . . . . 20

3.2 Summary of the aviation dataset. . . . . . . . . . . . . . . . . . . . . . . 21
4.1 Performance of different models for classifying drivers’ MWL utilising

AE-extracted features compared to manually extracted features. . . . . 31
4.2 Local accuracy in terms of MAE and nDCG values for SHAP and LIME
while explaining flight TOT delay prediction. . . . . . . . . . . . . . . . 37
A1.1 Traffic flow intensity in the experimental area during a day retrieved
from General Plan of Urban Traffic of Bologna, Italy. ©Springer Nature
Switzerland AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A1.2 Number of features selected from different techniques. ©Springer
Nature Switzerland AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . 78
A1.3 Parameters used in different classifiers. ©Springer Nature Switzerland
AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A1.4 Average performance measures of classifiers applied on traditionally
extracted features. ©Springer Nature Switzerland AG 2019. . . . . . . 81
A1.5 Average performance measures of classifiers applied on features
extracted by CNN-AE. ©Springer Nature Switzerland AG 2019. . . . . 81
A2.1 Mapping among different EEG channels, three significant frequency

rhythms and ID of features. ©2020 by Islam et al. (CC BY 4.0). . . . . 104
A2.2 List of different feature sets and corresponding number of features used
for validating the proposed methodology. ©2020 by Islam et al. (CC
BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
A2.3 Parameters used in building different models for prediction and
classification tasks. ©2020 by Islam et al. (CC BY 4.0). . . . . . . . . . 106
A2.4 The 10-fold CV summary in terms of Mean Absolute Error and Mean
Squared Error for predicting MWL score using EEG and Mutual
Information (MI)-based features. ©2020 by Islam et al. (CC BY 4.0). . 109
A2.5 Summary of one-sided Wilcoxon signed-rank tests on the average
performance in 10-fold CV of classification tasks by different classifiers
trained with MI and EEG based features. ©2020 by Islam et al. (CC
BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A2.6 Summary of DeLong’s test to compare Area Under the Curve (AUC)
values at significance level 0.05 (5.00 × 10−2 ). ©2020 by Islam et al.
(CC BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
xxi
A2.7 Performance summary of classifying Low and High MWL with LgR,
MLP, SVM and RF classifier models using EEG and MI-based feature
on the holdout test set. ©2020 by Islam et al. (CC BY 4.0). . . . . . . 111
A2.8 Performance summary of classifying Car and Pedestrian events with
LgR, MLP, SVM and RF classifier models using EEG and MI-based
feature on the holdout test set. ©2020 by Islam et al. (CC BY 4.0). . . 112
B.1 Inclusion and exclusion criteria for the selection of research articles.
B.2 Questions for checking the validity of the selected articles. ©2022 by
Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . 132
B.3 List of prominent features extracted from the selected articles. ©2022
by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . 135
B.4 List of references to selected articles published on the methods of XAI
from different application domains for the corresponding tasks. ©2022
by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . 138
B.5 Different models used to solve the primary task of classification or
regression and their study count. ©2022 by Islam et al. (CC BY 4.0). . 142
B.6 Methods for explainability, stage and scope of explainability, forms of
explanations and the type of models used for performing the primary
tasks. ©2022 by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . 149
C.1 Summary of the designed convolutional encoder. ©2021 IEEE. . . . . . 183

C.2 Performance summary of mental workload classification using RF and
SVM classifier model on the holdout test set. ©2021 IEEE. . . . . . . . 185
D.1 Associated scenarios for the laps of the experimental simulator and track
driving with varying driving conditions. ©2023 by SCITEPRESS (CC
BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
D.2 List of features extracted from vehicular signals. ©2023 by
D.3 List of biometric features considering different frequency bands of EEG
signal. ©2023 by SCITEPRESS (CC BY-NC-ND 4.0). . . . . . . . . . . 198
D.4 Summary of the datasets from the simulator and track experiments for
risk and hurry classification. ©2023 by SCITEPRESS (CC BY-NC-ND
4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
D.5 Parameters used in tuning different AI/ML models for classifying risk
and hurry in driving behaviour with 5-fold cross validation. ©2023 by
D.6 Performance measures of risky behaviour classification with the AI/ML
models trained on the holdout test set of different datasets. ©2023 by
D.7 Performance measures of hurry classification with the AI/ML models
trained on the holdout test set of different datasets. ©2023 by
D.8 Summary of model performances in terms of accuracy across different
datasets and classification tasks. ©2023 by SCITEPRESS (CC
BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
xxii
D.9 Pairwise comparison of performance metrics for SHAP and LIME on
combined Xtest (holdout test set) for risk and hurry. ©2023 by
E.1 Methods, metrics or axioms used for evaluating XAI methods with
references to the works in which they were proposed or employed. . . . . 220
E.2 Summary of the generated synthetic datasets for evaluation. . . . . . . . 227
E.3 MAE and standard deviation (σAE ) of XGBoost and AddCBR
predicting flight delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
E.4 Average impact on prediction measured in percentage for the change in
values of top and bottom five features based on their importance from
XGBoost and AddCBR. . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
E.5 The maximum (maxnDCG ), average (µnDCG ), and standard deviation
(σnDCG ) of nDCG scores for the feature ranking from SHAP and LIME
for all the test instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
E.6 Average impact on prediction measured in percentage for the change in
values of top and bottom five features based on their contributions from
SHAP and LIME. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
E.A.1 List of features in the flight delay dataset and their associated identifiers
used to refer to the features in the article. . . . . . . . . . . . . . . . . . 248
E.C.1 MAE and standard deviation (σAE ) of XGBoost and AddCBR
predicting flight TOT delay with the two- and eight-cluster datasets. . . 252
E.C.2 Average impact on prediction measured in percentage for the change in
XGBoost and AddCBR for the two-cluster dataset. . . . . . . . . . . . . 254
XGBoost and AddCBR for the eight-cluster dataset. . . . . . . . . . . . 254
E.C.4 The maximum (maxnDCG ), average (µnDCG ), and standard deviation
for all the test instances from the two-cluster dataset. . . . . . . . . . . 255
E.C.5 The maximum (maxnDCG ), average (µnDCG ), and standard deviation
for all the test instances from the eight-cluster dataset. . . . . . . . . . . 255
SHAP and LIME for the two-cluster dataset. . . . . . . . . . . . . . . . 257
SHAP and LIME for the eight-cluster dataset. . . . . . . . . . . . . . . . 257
F.1 Summary of the datasets used for evaluating the performance of iXGB. 267
F.2 Coverage scores of the rules extracted from iXGB and LIME. . . . . . . 270
F.3 Set of counterfactuals generated using iXGB from the Auto MPG dataset.271
F.4 Set of counterfactuals generated using iXGB from the Boston Housing
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
F.5 Set of counterfactuals generated using iXGB from the Flight Delay dataset.272
xxiii
List of Abbreviations & Acronyms
AddCBR Additive Case-based Reasoning

AE Autoencoder
AFA Additive Feature Attribution
AI Artificial Intelligence
ARTE Automated Artifact Handling in Electroencephalography
ATCO Air Traffic Controller
ATFM Air Traffic Flow Management
AUC Area Under the Receiver Operating Characteristic Curve
CBR Case-based Reasoning
CNN Convolutional Neural Network
CQV Coefficient of Quartile Variation
CV Cross Validation
DALEX Model-Agnostic Language for Exploration and Explanations
DARPA Defense Advanced Research Projects Agency
DL Deep Learning
DSS Decision Support System
EEG Electroencephalography
EOBT Estimated Off-block Time
ETFMS Enhanced Tactical Flow Management System
GBDT Gradient Boosted Decision Trees
GDPR General Data Protection Regulation
GPS Global Positioning System
iXGB Interpretable Extreme Gradient Boosting
JSD Jensen-Shannon Divergence
KLD Kullback-Leibler Divergence
kNN k-Nearest Neighbours
LgR Logistic Regression
LIME Local Interpretable Model-agnostic Explanations
LnR Linear Regression
LOO Leave-one-out
LSTM Long Short-term Memory
MAE Mean Absolute Error
xxv
MI Mutual Information
ML Machine Learning
MLP Multi-layer Perceptron
MSE Mean Squared Error
MWL Mental Workload
nDCG Normalised Discounted Cumulative Gain
NN Neural Networks
RBF Radial Basis Function
RC Research Contribution
RF Random Forest
ROC Receiver Operating Characteristic
RQ Research Question
RS Road Safety
SHAP Shapley Additive Explanations
SLR Systematic Literature Review
SVM Support Vector Machine
TOT Take-off Time
XAI Explainable Artificial Intelligence
XGBoost Extreme Gradient Boosting
xxvi
Part I
Thesis
1
Chapter 1
Introduction
This chapter introduces the research context and presents the problem
formulation, objectives, research questions and research contributions of the
doctoral research study, followed by the outline of this thesis.
A verdict without proper explanation is always questionable. This verdict could be

a grade on a student’s exam script by a teacher, a decision on a loan application of
some random person by a banker, a disease diagnosis of a patient by a physician,
etc. Incidents similar to these where the verdict or outcome is declared without
any feedback on how to improve the situation or what can be done to avoid some
events, etc. raises credibility issues. This inquisitive human behaviour leverages the
need for research to develop interpretable or explainable systems based on Artificial
Intelligence (AI) or Machine Learning (ML) since the traditional systems generate
predictions or decisions without presenting their reasoning explicitly. Over time,
the urge for explainability in AI or ML-based systems increased continuously and
escalated abruptly in recent years due to the three major incidents. First of all,
funding of the Explainable AI (XAI) Program in early 2017 by the Defense Advanced
Research Projects Agency (DARPA) (Gunning & Aha, 2019). Later that year, The
Chinese Government released The Development Plan for New Generation of AI to
encourage the high-explainability and strong extensibility of AI (Xu et al., 2019).
Last but not least, in mid-2018, the European Union granted their citizens a Right
to Explanation if they are affected by algorithmic decision-making by publishing
General Data Protection Regulation (GDPR) (Wachter et al., 2018). As a whole,
these mandates influenced the current abundance of research studies with XAI to
incorporate it in various prevailing systems.
XAI is the third wave of AI systems (Gunning, 2017). Whereas the first wave of
AI was brought to light through rule-based algorithms defined by experts. Most of
the applications that are being used today contain rule-based AI. With the growth
of data, the second wave of AI emerged. Statistical learning is the basis of this wave,
which is based on learning from the characteristics of the data. The AI systems
developed with statistical learning perform classification or clustering tasks and
generate probability values for the decisions but lack justification or explanation.
3
XAI for Enhancing Transparency in DSS
After decades, as the third wave, XAI has been proposed by researchers, which
can overcome the barrier of explainability and enable end users to understand
and effectively manage the emerging generation of AI systems (Gunning & Aha,
2019). Furthermore, some AI or ML models are still opaque, unintuitive, and
incomprehensible to humans regarding their inference mechanisms (Ribeiro et al.,
2016; Guidotti et al., 2019; Mueller et al., 2019). These models, such as Deep
Learning (DL) models, Support Vector Machines (SVM), etc. are often termed
as black box models since it is not clear to the end users how the models reach
the decision, i.e., the inference mechanism is not explicit. These black box models
construct the prevailing systems for assisting humans in various decision-making tasks
because of their commendable performance in decision accuracy. On the contrary,
they lack interpretability and/or explainability since the inference mechanisms of
these models are not transparent to the end-users for decision-making tasks. Thus,
the term transparency evolves that refers to the opposite characteristics of the black
box models, i.e., the understanding of the mechanism by which a model works
to assist the end users in decision-making (Adadi & Berrada, 2018; Lipton, 2018;
Barredo Arrieta et al., 2020).
Decision Support Systems (DSS) comprise an area in Information Systems, which
focuses on supporting and improving the decision-making process by humans (Arnott
& Pervan, 2014). There are different types of DSS, e.g., expert systems, analytic
systems, recommender systems, etc., that are used in various application domains
(Arnott & Pervan, 2014). DSS are one of the most facilitated applications of
traditional AI models (Negnevitsky, 2004). In addition, the conventional belief in
DSS literature is that when decision-makers are presented with enhanced processing
capabilities, they are likely to employ them for a more in-depth analysis of problems,
resulting in improved decision-making (Todd & Benbasat, 1992). Thus, the research
on developing explainable models in the DSS is emerging with the goal of enhancing
the transparency of the decision-making process. Adding explainability to a DSS
refers to incorporating details of the inference mechanism of the models that
present the initial decision to humans. The traditional AI-based DSS face the
challenge of being a black box system, where it is hard to understand the reasoning
behind a specific decision or output. Without explanations, Users might face
challenges in recognizing and dismissing inaccurate recommendations from a system,
and they may hesitate to accept valid advice from the system, even when it is
accurate (Lacave & Díez, 2002, 2004; Martens & Provost, 2014). XAI investigates
different methodologies to address this challenge by generating an interpretable and
human-understandable explanation for the decisions of AI-based models. It would
enable the practitioners with operational awareness to make the final decision, thus
ruling out the acceptability issues. Researchers from several safety-critical domains
are currently investigating the DSS with traditional AI models, such as Road Safety
(RS) and Air Traffic Flow Management (ATFM). In these domains, the availability
of a transparent and interpretable AI system would enable better-informed decisions,
increased trust, and reduced liability. This also aligns with the running hypothesis
of XAI research, that is to build more transparent, interpretable, or explainable
systems so that the users will be better equipped to understand and therefore trust
the intelligent agents (Mercado et al., 2016; Miller, 2019).
4
Chapter 1. Introduction
1.1 Research Goal and Objectives
Considering the urge for explainability in an AI-supported DSS, especially in the

domains of RS and ATFM, this doctoral research exploits existing XAI methods and
also proposes novel XAI methods to enhance the transparency of DSS. To this end,
the main goal of this thesis is:
Advancing the research for XAI by developing and evaluating explainable
models to enhance transparency that can be used to further the
development of DSS in different applications of RS and ATFM.
In order to achieve the research goal, the following objectives are set for this
doctoral research:
• Review of the literature on XAI methodologies deployed in various application
domains.
• Development of explainable models for assessing the in-vehicle state and
behaviour of drivers for RS.
• Development of an explainable model for predicting flight take-off time (TOT)
delays with feature attribution for ATFM.
• Development of methods for generating explanations in different forms to
support AI models’ decisions.
• Development of a plausible evaluation approach for XAI methods.
1.2 Problem Formulation
With the recent developments of AI and ML algorithms, people from various

application domains have shown increasing interest in taking advantage of this
development. As a result, AI and ML are being used today in every application
domain. Different AI or ML models are being employed to complement humans
with decision support in various tasks from diverse domains, such as education,
construction, healthcare, news and entertainment, travel and hospitality, logistics,
manufacturing, law enforcement, and finance (Rai, 2020). While these models
are meant to help users in their daily tasks, they still face acceptability issues
(Loyola-Gonzalez, 2019). Users often remain doubtful about the proposed decisions.
In worse cases, users oppose the AI or ML model’s decision since their inference
mechanisms are mostly opaque, unintuitive, and incomprehensible to humans. For
example, today, DL models demonstrate convincing results with improved accuracy
compared to established models. The outstanding performances of DL models hide
one major drawback, i.e., the underlying inference mechanism remains unknown to
a user. In other words, the DL models function as a black box (Guidotti et al.,
2019). In general, almost all the prevailing DSS built with AI or ML models do
not provide additional information to support the inference mechanism that makes
them nontransparent. Thus, it had become a sine qua non to investigate how the
inference mechanism or the decisions of AI or ML models can be made transparent
to humans so that these DSS become more acceptable to the users of different
5
application domains (Loyola-Gonzalez, 2019). This issue raises interest in exploring

the techniques employed to ensure explainability in AI applications, which leads to
defining the first research question (RQ) of this doctoral research.
In accordance with the stipulations of the research projects (stated in the
preamble of Chapter 3) supporting this doctoral study, RS and ATFM are set as
the two concerned domains. RS is an important issue as it is evident in literature
that more than 90% of traffic crashes occur from errors committed by drivers (Sam
et al., 2016). However, driving is a complex and composite task. Several dynamic and
complex secondary tasks, e.g., simultaneous cognitive, visual, and spatial tasks are
involved in driving. Diverse secondary tasks, along with natural driving in different
road environments, cause increased mental workload (MWL) and risky behaviour of
drivers that lead to errors in traffic situations (H. Kim et al., 2018). An alarming
number of traffic accidents due to increased MWL and risky behaviour leverages the
need to monitor these characteristics efficiently. On the other hand, DSS developed
with AI models are now widely investigated in the domain of ATFM in several
online and offline tasks (Degas et al., 2022). One of the major tasks for ATFM
practitioners is to observe flown data of flights and predict flight TOT delays and
propagated delays to maintain the demand-capacity balance (Dalmau et al., 2021).
In several cases, practitioners accept the decision of the AI models by considering the
decision with their expertise through associated explanations with the decision from
the models that are often expected by them. In summary, the recent worldwide urge
for explainability in systems developed with AI or ML models demands explanations
from the models for specific decisions and for the whole model as well. Thus, the
second RQ is defined to exploit the ways of generating local and global explanations
for black box models to enhance transparency in DSS. Relevantly, another issue is tied
to the unfortunate scarcity of ideal metrics and methods to evaluate the developed
XAI models. It paves the path to investigate plausible techniques to evaluate the
explainability methods.
On another note, the variation in MWL is significantly illustrated by the frequency
domain features of electroencephalography (EEG) signals (Di Flumeri et al., 2018,
2019). Determining drivers’ MWL using EEG signals requires a huge effort in
collecting the EEG. It is also quite uncomfortable for a person to wear the cap holding
electrodes to capture EEG signals while performing important tasks like driving.
Moreover, feature extraction from EEG signal is done using theory-driven manual
analytic methods that demand huge time and effort (Tzallas et al., 2009; Ahmad
et al., 2014). In response to the situation, this study also aims to explore methods
to automate feature extraction to measure in-vehicle MWL and the behaviours of
drivers, subsequently developing an explainable model with human-understandable
features to induce data transparency in the DSS of RS.
1.3 Research Questions
To generalize the formulated problems, set the specific objectives for this doctoral
research, and upon completion, to facilitate the validation of the research with
concrete outcomes, the following RQs are asserted:
RQ1: How are the XAI methods implemented and evaluated across various
application domains?
6
The literature contains evidence of diverse XAI methods to make traditional

AI models explainable and interpretable. This RQ is formulated to investigate
the appropriate XAI methods for specific tasks in different applications and
domains. Furthermore, the existing evaluation approaches for XAI methods
are also investigated while addressing this RQ.
RQ2: How can the XAI methods be exploited to enhance transparency in DSS?
Literature on XAI presents a variety of XAI techniques to induce transparency
in the existing AI-based DSS. Through answering this RQ, appropriate
methods are identified, implemented, and evaluated to enhance the
transparency of the DSS for the domains of RS and ATFM.
RQ2.1: How can the feature extraction techniques be enhanced to develop

features that are comprehensible to humans?
This RQ is formulated concerning the specific need to explain the
abstract features extracted by DL methods to humans. These
abstract features are efficiently analyzed by the DL models to
generate accurate predictions. But, while explaining the features
and their contribution to the prediction, the features remain
incomprehensible to human users. Answering this RQ produced
a novel method that can represent the features generated by
autoencoders in a comprehensible and understandable form to
human users.
RQ2.2: How can the explanations be derived for the decisions of AI or ML
models?
According to the definition of the scope of explainability, presented
in section 2.1, explanations can be provided for the decision of an
AI model at local and global scope. Though the answers to this RQ,
specific methods are studied extensively to generate local and global
explanations, followed by the development of explainable models for
the target applications of RS and ATFM.
RQ2.3: How can the explainability methods be evaluated with plausible
techniques?
It is evident in the literature that there is a scarcity of the standard
evaluation metrics of explainable methods alike the traditional AI
models. A standard and generalized method is developed to assess
the quality of the explanations by addressing this RQ.
1.4 Research Contributions
The RQs outlined in Section 1.3 are addressed with several research contributions
(RC) in this thesis. Here, the contributions are outlined with brief descriptions, along
with mentions of their dissemination in the included research papers:
RC1: Investigation of the current developments in XAI.
7
Due to the current outrage of the XAI research, a huge number of regular
publications contain different aspects of explainability. Through a proper
literature review, the noteworthy developments in XAI are summarised
as a crib sheet for initiating research work on XAI methodologies with
quick references. This contribution addresses the RQ1 and includes specific
contributions from different perspectives:
RC1.1: From a generic perspective of different application domains and tasks.

RC1.2: From a specific perspective of evaluating XAI methods.
The RC1 has been disseminated in two of the included papers in this
thesis. Paper B contains a systematic literature review of the exploited
XAI methods across different domains and applications, which corresponds
to RC1.1. The RC1.2 is asserted particularly for the developments in the
approaches of evaluating XAI research upon realizing that there exists a
need for a consensus on the evaluation methods for XAI. Paper B presents
the evaluation approaches for XAI in different domains and applications.
Particularly, the approaches for evaluating XAI methods through quantitative
experiments are discussed in the background section of the Paper E.
RC2: Development of applications of XAI for RS and ATFM domains.
DSS developed with AI models are already prevailing in the safety-critical
domains of RS and ATFM. However, these domains require further attention
in terms of XAI research to enhance the transparency and acceptability of
the prevailing systems. To address the issue, the following contributions are
made to enhance the transparency of the systems developed with explainable
methodologies for the domain of RS and ATFM:
RC2.1: Algorithms to construct comprehensible features for MWL

assessment.
RC2.2: Quantification and classification of MWL with explanation while
driving.
RC2.3: Classification of driving behaviours and interpretation of the models’
decisions.
RC2.4: Explainable model for flight TOT delay prediction.
The first three contributions are for the domain of RS. Papers A1 and A2
both disseminate the RC2.1, i.e., the construction of a feature set that is
comprehensible to humans for interpreting drivers’ MWL assessment model.
Though RC2.1 is not directly connected to the methodologies of XAI, it
contributes to data transparency and the outcome of the corresponding
studies influenced the use of the developed methodology in RC2.2, and this
contribution is disseminated in Papers A1, A2, and C. The RC2.3 is about the
development of explainable driving behaviour classification models, and their
evaluations are presented in Paper D. The RC2.4 is asserted for the domain of
ATFM, which concerns the development of an explainable flight TOT delay
prediction model. The development of the explainable model is described in
8
Section 4.2.1 of this thesis. In addition, the flight TOT delay prediction model
is used in the process of developing an evaluation approach for XAI methods
and generating rule-based and counterfactual explanations, that are presented
in Papers E and F, respectively.
In brief, concerning the RQs of this doctoral research, RC2.2, RC2.3, and
RC2.4 correspond to RQ2.2 in terms of RS and ATFM. In addition, RC2.3
and RC2.4 correspond to RQ2.3. And, RQ2.1 is addressed by RC2.1 and
RC2.2 combinedly.
RC3: Advancement of the XAI research field.
The research field of XAI is continuously growing, and diverse methods are
evolving regularly. Still, the literature indicates that certain aspects need
further attention to advance XAI research, especially the evaluation of XAI
methods. To this end, the following contributions are asserted to advance the
emerging field of XAI research:
RC3.1: Methods for generating explanation in various forms.
RC3.2: Approach for evaluating XAI methods.
Two novel methods for generating explanations are developed in this doctoral
research, which are accorded as the RC3.1. The first method is the
additive form of Case-based Reasoning (CBR), namely AddCBR. The second
method is developed to interpret the decision of Extreme Gradient Boosting
(XGBoost) models. The method is named Interpretable XGBoost (iXGB).
The development of the methods AddCBR and iXGB are discussed in Papers
E and F, respectively. The last but not least contribution of this doctoral
thesis, RC3.2 concerns the development of an approach to evaluate XAI
methods using a synthetic dataset that captures the intrinsic behaviour of the
original data. The evaluation method is described in Paper E. As a whole,
the RC3 encapsulates the contributions to the advancements of the research
field of XAI and thus addresses specifically RQ2.2 and RQ2.3.
1.4.1 Mapping of Research Questions, Contributions and Papers

The outlined contributions of this doctoral research are presented through seven
research papers where the RQs defined in Section 1.3 are concurrently addressed.
Figure 1.1 illustrates a generalized mapping of the RQs, RCs and the included papers.
1.5 Thesis Outline
The structure of this thesis is twofold and is organised as follows:
• Part I – Thesis
This part provides a comprehensive overview of the thesis. It contains the
introduction to the thesis in Chapter 1, background and related work in Chapter
2, materials and methods used in this doctoral study are discussed in Chapter 3,
description of the developed application, their evaluation methods and results
9
Research Questions (RQ) Research Contr ibutions (RC) Paper s
RC1: Investigation on the Paper B: A Systematic

RQ1: How are the XAI current developments in XAI Review of XAI in terms of
methods implemented RC1.1: For different Different Application
and evaluated across application domains & tasks Domains and Tasks
various application
domains? RC1.2: In terms of
Paper A1: DL for
evaluation of XAI methods
Automatic EEG Feature
Extraction
RQ2: How can the XAI RC2: Development of XAI
methods be exploited to applications for DSSs Paper A2: A Novel MI
enhance transparency in RC2.1: Algorithms for based Feature Set for
DSS? comprehensible feature Drivers' MWL Evaluation
construction using ML
RQ2.1: How can the RC2.2: Quantification and
feature extraction classification of MWL with Paper C: Local and Global
techniques be enhanced explanation Interpretability using
to develop features that Mutual Information in XAI
are comprehensible to RC2.3: Classification and
humans? interpretation of driving
behaviour Paper D: IML for
Modelling and Explainaing
RC2.4: Explainable model Car Drivers' Behaviour
RQ2.2: How can the for flight TOT delay
explanations be derived prediction
Paper E: Investigating
for the decisions of AI
Additive Feature
or ML models?
RC3: Advancement of the XAI Attribution for Regression
research field.
RQ2.3: How can the RC3.1: Methods for
explainability methods Paper F: iXGB: Improving
generating explanation the Interpretability of
be evaluated with
plausible techniques? RC3.2: Approach for XGBoost with Decision
evaluating XAI methods Rules and Counterfactuals
Figure 1.1: Generalized mapping of the RQs, RCs, and the included papers. For
a concise presentation, the titles of the papers and the RCs are presented with
abbreviations, and the sequence of the papers is rearranged to minimize the overlapping
of the links.
are presented in Chapter 4, the included papers are summarised in Chapter

5 along with the RCs and finally, the discussion on the RQs, limitations,
conclusion and future research direction is presented in Chapter 6.
• Part II – Included Papers
This part consists of the seven papers that are included in this thesis. The
papers are reformatted to comply with this thesis layout from the versions
they were published in or submitted to the respective journals or conferences.
10
Chapter 2
Background and Related Works
This chapter contains the foundational insights into the research domain
and presents a critical review of the existing literature within the specific
applications addressed in this thesis.
2.1 Explainable Artificial Intelligence
The theoretical aspects of Explainable Artificial Intelligence (XAI) are concisely

presented in this section. Here, the aspects are discussed from a technical point
of view for a better understanding of the contents of this study. Emphatically,
philosophy and taxonomy towards XAI have been excluded as they are out of the
scope of this study.
2.1.1 Concepts of Explainability

The prime hindrance towards developing the ground knowledge of explainability
concerning Artificial Intelligence (AI) is the interchangeable use of several terms
in the literature, such as interpretability, transparency, explainability etc. Before
proceeding to the literature review, the commonly used terms are presented
briefly according to the definitions compiled by Barredo Arrieta et al. (2020).
Understandability, often termed as Intelligibility also, is the characteristics of a
model that makes a user realise its functions, in other words, how the model works
without any requirement of further explanation for the model’s internal operations
on the data. Another similar term that has been used to define the ability of
an ML model to represent its learned knowledge to humans in an understandable
way is Comprehensibility. Clearly, the prior terms differ in representing the internal
operations with the data and the knowledge acquired from the data. In addition,
the terms Interpretability and Transparency are mostly used in describing similar
concepts to Explainability. In fact, interpretability refers to a model’s ability to
provide meaning or explain in an understandable way to human beings. Nonetheless,
transparency of a model indicates the ability to be understandable to humans.
Above all, the term explainability affiliates the interface between humans and
11
decision-makers, which is concurrently comprehensible to humans and an accurate

representation of the decision-maker (Guidotti et al., 2019). In XAI, explainability is
the interface between the models and the end-users through which an end-user gets
clarification on the decisions received from an AI or ML model.
2.1.1.1 Stage of Explainability

The AI models learn the underlying characteristics of the available data and
subsequently try to classify, predict or cluster new data. The stage of explainability
refers to the period in the process mentioned above when a model generates the
explanation for the decision it provides. The stages are found to be ante-hoc and
post-hoc (Vilone & Longo, 2020). Brief descriptions of the categorised methods based
on these stages are given below and illustrated in Figure 2.1.
(a) Ante-hoc. (b) Post-hoc.
Figure 2.1: Different stages of adding explainability to black box models.
• Ante-hoc methods typically involve generating explanations for decisions from

the initial stages of training on the data, with the goal of achieving optimal
performance. These methods are commonly employed to produce explanations
for transparent models like Fuzzy models, Tree-based models, etc. The
corresponding explanations include probability values or decision paths from
the internal structure of the classification or prediction models.
• Post-hoc methods involve the incorporation of an external or surrogate model
alongside the base AI model. The base model remains unaltered, while
the external model emulates the behaviour of the base model to produce
an explanation for users. Typically, these methods are linked to models
where the inference mechanism remains undisclosed to users, e.g., Support
Vector Machines (SVM), Neural Networks (NN), etc. Furthermore, post-hoc
methods can be categorized into two types: model-agnostic and model-specific.
Model-agnostic methods can be applied to any AI model, while model-specific
methods are limited to particular models. In current research trends, post-hoc
model-agnostic methods are exploited more often because of their ability to be
easily incorporated with existing AI models (Islam et al., 2022).
12
Chapter 2. Background and Related Works
2.1.1.2 Scope of Explainability

The extent of an explanation generated by explainable methods is defined by the
scope of explainability. In a literature review study of scientific articles on XAI,
Vilone and Longo (2020) identified two primary scopes: global and local. At the global
scope, the entire inferential process of a model is made transparent or understandable
to the user. An example is a Decision Tree (DT), as depicted in Figure 2.2a.
Conversely, at the local scope, a single instance of inference is explicitly presented to
the user. In the case of a DT, a single branch or decision path can be considered a
local explanation that is illustrated in Figure 2.2b.
(a) Global. (b) Local.
Figure 2.2: Different scopes of adding explainability to black box models illustrated
with an example decision tree. The nodes in black colour refer to decision nodes for
a single feature. The green and red edges in the tree refer to positive and negative
outcomes, respectively, for satisfying the conditions.
2.1.1.3 Transparency of AI Models

Transparency (Lipton, 2018) is synonymous with model interpretability, i.e., some
sense of understanding of the working logic of the model (Došilović et al., 2018).
From the definitions by Barredo Arrieta et al. (2020), transparency can be achieved
in three different forms:
• Simulatability refers to a model’s capacity to be simulated or mentally processed
exclusively by a human, thus, complexity assumes a prominent role within this
category. This characteristic aligns with the properties of the models that can
be easily presented to a human by means of text and visualisations (Ribeiro
et al., 2016).
• Decomposability refers to the capacity to explain each component of a model,
including input, parameters, and calculations. This characteristic empowers the
ability to understand, interpret or explain the behaviour of a model. In reality,
very few models contain decomposability, e.g., DT, k-Nearest Neighbours, etc.
• Algorithmic Transparency refers to the user’s capacity to grasp the procedures
employed by the model in producing a particular output from its input
data. The primary limitation of algorithmically transparent models lies in the
necessity for the model to be entirely comprehensible through mathematical
analysis and methods.
13
2.1.2 Explainable Artificial Intelligence Methods

During the past couple of years, research on the developing theories, methodologies
and tools of XAI has been very active, and over time, the popularity of XAI as a
research domain has continued to increase. New methods for adding explainability
to AI models are regularly developed. A recent literature review study found
that 102 different explainability methods are presented in 128 published research
papers. Among these methods, most of them fall in Additive Feature Attribution
(AFA), association rule-based explanations, saliency maps, etc. Different forms of
explanations are generated by those methods, e.g., numeric, rules, visualisation,
textual and combination of all others (Islam et al., 2022). In this study, mostly
the class of AFA methods are exploited, from which explanations can be generated
in a combined form of numeric values and visualisations.
2.1.2.1 Additive Feature Attribution

The most common class of generating post-hoc model-agnostic explanation is AFA
methods (Islam et al., 2022). Lundberg and Lee (2017) first identified the class
and proposed the tool SHAP – Shapley Additive Explanations that is based on the
distributed gain in a coalition game theory utilising Shapley values (Shapley, 1953)
thus inheriting their properties. This class is referred to as additive because of the
efficiency property from Shapley values (Shapley, 1953) that shows that the gains
shared by all players in a coalition game equal the value of the grand coalition. This
property becomes local accuracy for AFA methods (Equation 2.1) where g is the
explanation model where the property of local accuracy is demonstrated when g(zi )
matches the model prediction r(xi ) for each data instance xi , where g(zi ) is computed
on the vector zi which transforms xi by the function h(xi ) making zi ∈ {0, 1}m where
m represents the number of features:
m
X
g(zi ) = ϕ0 + ϕj zij (2.1)
j=1
Another tool, LIME – Local Interpretable Model-agnostic Explanation (Ribeiro

et al., 2016) is also an AFA method proposed before the establishment of the term
AFA. LIME fits a linear regression to explain the behaviour of a sample point. To
obtain points for fitting a linear regression, LIME randomly perturbs the point to be
explained using the points closest to the target point. The coefficients of the linear
regression in LIME are used to produce ϕ values for Equation 2.1 and predict the
output of the model as g(zi ).
2.1.3 Evaluation of Explainable Artificial Intelligence Methods

The evaluation of XAI methods has become increasingly important as the urge for
transparent and interpretable AI models grows. In fact, evaluation approaches in
XAI pose challenges. The emphasis on user-centric explainability often results in
the quality of explanatory information outputs being dependent upon factors such
as the user, the domain, and the application (Doshi-Velez & Kim, 2017; Gunning,
2017; Mueller et al., 2019; Barredo Arrieta et al., 2020; J. Zhou et al., 2021). The
14
development of methodologies or definitions of metrics to evaluate the explanation

generation techniques as well as to assess the quality of the generated explanations is
comparatively lower than the extreme increase in research works devoted to exploring
new methodologies of XAI. A recent review study has shown that the majority
(i.e., 81%) of the publications presenting applications are not evaluated, where AI
models are deployed in real-world problems and explanations are generated (Mainali
& Weber, 2023). In another literature study on 137 articles, only 9 articles were found
to be fully intended for the evaluation and metrics for XAI. However, all the articles
proposing new methods to add explainability considered one of the three techniques
to assess their explainable model or the explanations generated by the models.
These techniques were (i) user studies, (ii) synthetic experiments, and (iii) real
experiments (Islam et al., 2022). According to Doshi-Velez and Kim (2017), there are
several different approaches to evaluating the explainability of AI models, including
quantitative and qualitative methods. Quantitative methods rely on metrics such as
accuracy and comprehensibility to measure the quality of the explanation provided
by an AI model. Qualitative methods, on the other hand, involve user studies and
expert evaluations to assess the effectiveness of an explanation.
2.1.3.1 Related Works

Many researchers have attempted to evaluate and comparatively analyse explanation
methods (e.g., Adebayo et al., 2018; Man and Chan, 2021; Y. Zhou et al., 2022).
There are multiple ways to categorise explainable methods, but a valuable and often
dismissed perspective is to consider the information type an XAI method produces.
Methods that reply to the question, why not something else? produce counterfactual
instances and cannot be included in the same category as feature attribution methods,
which aim to produce contributions of instance features. Y. Zhou et al. (2022) pointed
out the fact that attribution is not a well-defined term as they compare additive (e.g.,
SHAP (Lundberg & Lee, 2017)) against non-additive methods such as (e.g., Smilkov
et al., 2017; Selvaraju et al., 2020). Their rationale for the selection is that all these
methods can be used to produce visualisations known as saliency maps. Y. Zhou
et al. (2022) proposes to transform datasets as a means to create ground-truth data
and assess whether these methods can succeed in recovering them. However, none of
the methods can be considered satisfactory (Ahmed et al., 2022).
The benefit of limiting the set of methods to evaluate lies in the ability to compare
along the same deliverable. AFA methods (Lundberg & Lee, 2017) share the same
properties and thus use the features they identify with the highest importance, and
their local accuracy seems a reasonable starting point. As recommended by various
authors (e.g., Sundararajan et al., 2017; M. Yang and Kim, 2019; DeYoung et al.,
2020; Liu et al., 2021), the use of benchmark datasets is valid as long as the evaluation
is limited to feature importance or local accuracy. As previously described (F. Yang et
al., 2019), benchmark datasets are not recommended for evaluating explanations for
user consumption because explanations are user-, context-, and application-specific
(e.g., Gunning and Aha, 2019; Mueller et al., 2019; Barredo Arrieta et al., 2020).
Arbitrarily, the evaluation of explainable methods is crucial for developing more
transparent and trustworthy AI models. As more organisations begin to adopt
AI-based DSS, the need for effective evaluation methods will only become more
pressing. By continuing to develop evaluation frameworks, this research study can
15
help to ensure that explainable AI is both accurate and effective in supporting human
decision-making by comparing two XAI methods that belong to the category of AFA
methods, namely, SHAP (Lundberg & Lee, 2017) and LIME (Ribeiro et al., 2016).
In addition, this research also develops a novel method of generating a benchmark
dataset to evaluate the XAI methods.
2.2 Road Safety
Road safety is largely related to the drivers’ behaviour and characteristics as more
than 90% of traffic injuries occur due to drivers’ errors while driving (Sam et al.,
2016), which are a combination of several dynamic and complex activities including
simultaneous visual, cognitive and spatial tasks (H. Kim et al., 2018). Fastenmeier
and Gstalter (2007) defined driving as a human-machine system that continuously
changes with the environment. The components of the environments are traffic flow
(high or low), road layout (straight, junctions, roundabout or curves), road design
(motorways, city or rural), weather (rainy, snowy or windy), time of day (morning,
midday or evening), etc. These components define the overall complexity of the
driving task. To increase drivers’ vigilance during driving, different policy-making
authorities worldwide have published mandates for compulsory installation of safety
features in the new productions of automotive vehicles from the year 2022 (European
Commission, 2019). These mandates demand extensive research on the development
of intelligent systems for road safety that are transparent and acceptable to humans.
2.2.1 Related Works

AI approaches have been being investigated for a long period of time in the
development of in-vehicle road safety features such as driver drowsiness and
distraction detection, intelligent speed assistance, reversing safety with cameras or
sensors etc. Researchers utilised available abundant and diverse forms of data, such
as vehicular signals, neurophysiological signals, etc., to assess drivers’ in-vehicle states
(Begum & Barua, 2013; Aricò, Borghini, Di Flumeri, Colosimo, Bonelli, et al.,
2016; Barua et al., 2017; Charles & Nixon, 2019; Islam et al., 2019; Islam et al.,
2020). Besides, few works were done to add explainability to the systems that assess
drivers’ states. F. Zhou et al. (2022) proposed an explainable model for drivers’
fatigue prediction using Gaussian Process Boosting and SHAP. In another work,
an explainable fuzzy logic system was proposed that can capture the uncertainties
within driving features and classify drivers in terms of risky driving styles (Mase
et al., 2020). An explainable model for detecting riding patterns of motorbikes was
developed by Leyli abadi and Boubezoul (2021). This work is a contribution to road
safety and XAI but does not directly relate to drivers. Towards this end, no work
was found that incorporates physiological and vehicle signals into modelling drivers’
states and behaviours, let alone adding explainability to it. Whereas, this research
study is partially aimed at developing XAI models for assessing drivers’ in-vehicle
states and behaviours that can be further incorporated with advanced safety features
in automotive vehicles with explanations.
16
2.3 Air Traffic Flow Management
Air Traffic Flow Management (ATFM) is a branch of Air Traffic Management that
is a vast and complex domain (Erdi, 2008) encompassing all activities carried out to
ensure the safety and fluidity of air traffic. Specifically, ATM aims at efficiently
managing and maximising the use of the different resources available to it, for
example, the airspace and its subdivisions such as the sectors, the air routes, the
airport, and the runways by the users of the resources, e.g., aircrafts, airlines, in
any timeframe of their use of the resources, i.e., in the taxi phase in the airport, or
any flight phase simplified by the triplet climb, cruise, descent—while ensuring flight
safety (Allignol et al., 2012).
However, ATFM largely deals with balancing the demand and capacity of air
traffic by modifying the airspace (Degas et al., 2022; Hurter et al., 2022). Generally,
the airspace is divided into several sectors by the respective flying authorities, and
the airspaces have a limited capacity to handle the number of flights flying over
the airspace. The demand for airspace usage is put forward by different aircraft
carriers. The ATFM practitioners analyse the historical data of flown flights and
maintain the balance of demand and capacity by considering the propagation of
flight take-off time (TOT) delays. ATFM costs, on average, approximately 100 Euros
per minute for airlines (Cook & Tanner, 2015). According to the Federal Aviation
Administration1 (FAA) report2 in 2019, the estimated cost due to delay, considering
airlines, passengers, lost demand, and indirect costs, was thirty-three billion dollars.
This high cost justifies the increased interest in predicting TOT delays (Dalmau et
al., 2021).
2.3.1 Related Works

The literature contains pieces of evidence of research works with AI in the domain of
ATFM. Most of the works are based on predicting flight delays. The TOT is one of
the root indicators of the delay of an aircraft as it propagates to all transportation
networks, hence, predicting it is the key to enhancing air traffic (Dalmau et al., 2019).
Predicting the delay of TOT is a regression problem, where feature sets (both numeric
and categorical) are used from flight plans, weather reports, and airline information.
Departure delay has been characterized considering the spatial and temporal aspects
(e.g., Rebollo and Balakrishnan, 2014; Y. J. Kim et al., 2016; Dalmau et al., 2019; Yu
et al., 2019; Kovarik et al., 2020; Tran et al., 2020). The methods used for predicting
tasks in ATFM include NN, random forest (RF), gradient boosting machines, SVM,
and linear regression (Kovarik et al., 2020).
Unfortunately, despite several research works already carried out in AI for the
ATFM domain, it has not been fully operational nor has it brought any benefits to
end users (Degas et al., 2022). Slow progress within the use of AI in the ATM domain
is justified by the fact that the ATM domain is a critical domain with life at stake
and that safety is the top priority. However, the current research trend on XAI has
resulted in obscure attention to incorporating explainable methods with a view to
1 https://www.faa.gov
2 https://www.faa.gov/data_research/aviation_data_statistics/media/cost_delay_estimates.
pdf
17
increasing the trust of ATFM practitioners in the AI-based DSS that motivates the
development and evaluation of an explainable flight delay prediction model within
the scope of this doctoral research study.
18
Chapter 3
Materials and Methods
This chapter outlines the research methodology, data acquisition, analytical

techniques, and evaluation methods used in this thesis.
This doctoral research has been conducted as exploratory research under the
framework of three scientific research projects: two for the domain of Road Safety
(RS) and one for the domain of Air Traffic Flow Management (ATFM). The first
project SIMUSAFE 1 (SIMUlation of behavioural aspects for SAFEr transport),
was funded by the European Union’s Horizon 2020 Research and Innovation
Programme. This project was a collaboration among several institutions across
Europe and aimed to improve driving simulators and traffic simulation technology
with Machine Learning (ML) to safely assess risk perception and decision-making
of four principal types of road users - Car Drivers, Motorcyclists, Bicycles and
Pedestrians. In this research study, the behaviours of car drivers are analysed
only; other types of road users were analysed by other research partners. The
second project BrainSafeDrive 2 developed a technology to detect drivers’ in-drive
mental state from neuro-physiological signals for improving the safety of the
road. It was initiated as a collaboration between the Sapienza University of
Rome & BrainSigns s.r.l.3 from Italy and the Mälardalen University (MDU)
from Sweden co-funded by the Vetenskapsrådet - The Swedish Research Council4
and the Ministero dell’Istruzione dell’Università e della Ricerca della Repubblica
Italiana5 under Italy-Sweden Cooperation Program. Both the projects, SIMUSAFE
and BrainSafeDrive were Lastly, the third project ARTIMATION 6 (Transparent
Artificial Intelligence and Automation to Air Traffic Management Systems), was a
collaborative project that was conducted to provide a transparent and explainable
model through visualisation, data-driven storytelling and immersive analytics. This
2 https://www.brainsafedrive.brainsigns.com
3 https://www.brainsigns.com
4 https://www.vr.se
5 https://www.mur.gov.it
6 https://www.artimation.eu
19
project took advantage of human perceptual capabilities to better understand the

algorithms with visualisations as support for XAI, exploring the domain of ATFM.
3.1 Data Acquisition
The data for the thesis work was acquired as parts of all the projects mentioned
earlier. The data utilised for the studies on applications of XAI in RS was acquired
from BrainSafeDrive and SIMUSAFE, which contained heterogeneous forms of data
from driving experiments. For ATFM, the aviation dataset was acquired within the
framework of the project ARTIMATION.
3.1.1 Driving and Physiological Datasets

Two different datasets were used to conduct the studies to develop the applications
for RS. The first dataset was collected within the project BrainSafeDrive in Italy
by the collaborating partner institutions. This dataset has been used for the studies
presented in three of the included Papers A1, A2, and C, and the experimental
protocol used to acquire the dataset is described in the corresponding Sections A1.3.1,
A2.3.1, and C.2.1 of the papers.
The second dataset was collected from another experiment carried out in Poland
within the project SIMUSAFE. This dataset has been used to conduct the studies
presented in Paper D, and the data collection protocol is presented in Section D.2.2.
For both datasets, the vehicular signals, Electroencephalography (EEG) signals
and information on the road structures have been analysed in this doctoral
study. In addition, the dataset from the project SIMUSAFE also contained
Electrocardiography (ECG) and Galvanic Skin Response (GSR) signals. Table 3.1
presents a summary of the participants, number of laps, acquired signals, etc.
Table 3.1: Summary of the datasets from the domain of RS.
Parameters BrainSafeDrive SIMUSAFE

Number of participants 20 16
Age of participants 24.9 ± 1.8 18-24 & 50+
Gender of participants Male Male & Female
Simulator & Track
Type of experiment Real Driving
Driving
Number of laps 3 7
1 for each of Simulator &
Number of iterations 2
Track
Vehicular Signal, EEG,
Vehicular Signal, EEG &
Analysed data ECG, GSR &
Environment
Environment
Data labels Experts’ annotation Experts’ annotation
20
Chapter 3. Materials and Methods
3.1.2 Flight Dataset

The data that was utilised for the analysis and development of XAI methods for
ATFM was acquired from EUROCONTROL7 . The source of the data was Enhanced
Tactical Flow Management System (ETFMS) flight data messages for the flown
flights during the period of May – October 2019. The dataset includes basic
information about the flight, the status of the flight and previous flight leg, different
air traffic regulations, weather and calendar information, etc. Table 3.2 contains the
summary of the dataset acquired for developing the explainable model for flight delay
prediction. The experiments with the aviation dataset are reported in Papers E and
F and co-authored work (Jmoona et al., 2023). Additionally, Appendix E.A of Paper
E contains a detailed description of the features from the aviation dataset.
Table 3.2: Summary of the aviation dataset.
Parameters Values
Number of instances 7,613,584
Number of aircrafts 18,214
Number of flights 609,202
Maximum number of flights per day 152
Minimum number of flights per day 1
Average number of flights per day 12
Minimum delay (minutes) 0
Maximum delay (minutes) 68
Average delay (minutes) 15
3.1.3 Ethical Considerations

All the research activities have been conducted within the framework of three different
research projects. Respective ethical guidelines were followed while performing the
experimental studies for particular projects. Separate non-disclosure agreements
(NDA) were signed for each project to maintain the standard ethical issues, i.e.,
protect the rights of research participants, enhance research validity and maintain
scientific or academic integrity. As the datasets were acquired in different locations
and collaborating partner institutions, respective national laws were followed, which
are summarized below for each research project:
• BrainSafeDrive – The experimental data was collected by the researchers from

the Department of Molecular Medicine, Sapienza University of Rome8 , Rome,
Italy. The experiment was conducted following the principles outlined in the
Declaration of Helsinki of 1975, as revised in 2000 (World Medical Association,
2001). Informed consent and authorization to use the video graphical material
were obtained from each subject on paper after the explanation of the study.
7 https://www.eurocontrol.int
8 https://web.uniroma1.it/dmm
21
• SIMUSAFE – The data was collected by APTIV9 , Kraków, Poland. The

experiment was conducted after the approvals of the APTIV European Legal
Team following the Polish Constitution (Article 29) in agreement with the
General Data Protection Regulation (GDPR) (Wachter et al., 2018). Each
subject signed the informed consent, and they hold the right to withdraw their
data from studies.
• ARTIMATION – In this project, the data was retrieved from
EUROCONTROL Aviation Data for Research Repository10 online portal. The
terms of use for the data include sole utilisation for research and development
(R&D) purposes and prohibit sharing.
The data acquired through the procedures mentioned in Section 3.1 are
anonymised prior to reception at MDU, which leaves no vulnerability to the rights
of the research participants. As a data processor within the framework of MDU, the
research data will be stored for a minimum of 10 years following the regulations of
Swedish Universities. However, upon request from the participants, the data will be
erased following the General Data Protection Regulation (GDPR) (Wachter et al.,
2018) supplemented by Swedish National Laws, such as the Data Protection Act
(2018:218).
3.2 Research Methodology
This doctoral thesis presents the outcome of the research conducted at different levels,
such as literature review, data collection and development of new methodologies
through several exploratory research (Yeation et al., 1995), i.e., all the studies have
been conducted with three stages: exploration, generation and evaluation. And, the
inductive approach (Young et al., 2020) has been followed to present the outcome of
the studies, which includes three steps: observation, generalization, and paradigm.
The levels concerning the exploratory research works and literature review are briefly
described in the following paragraphs, and Figure 3.1 illustrates the connections
among the significant aspects, including data collection.
The performed exploratory research works are scrutinised below, which are
presented altogether to enhance the transparency of DSS with the utilisation of XAI
methods:
• In the initial stage of the doctoral study, the traditional technique of feature
extraction from EEG signal was automated with the use of the convolutional
neural network - autoencoder (CNN-AE). This exploration of CNN-AE reduced
the complex and computationally expensive approach of signal processing and
manual calculations of EEG feature extraction for assessing drivers’ mental
workload.
• A novel hybrid template for vehicular parameters and EEG to measure drivers’
in-vehicle mental workload was derived using mutual information (MI). Upon
recording the EEG signal once, the use of the template can further be utilised to
9 https://www.aptiv.com
10 https://www.eurocontrol.int/dashboard/rnd-data-archive
22
Explor ator y Research

Exploration of an automated EEG
feature extraction technique to
L iter ature Review assess drivers' MWL Data Collection
Use of vehicular data and EEG in Definition of a hybrid template Data of natural driving collected
driver monitoring for vehicular signal and EEG to in project BrainSafeDrive
measure drivers' in-vehicle MWL
Applications of XAI in different Development of an explainable Data of simulator and track

domains and particularly in RS model for drivers' MWL driving collected in project
and ATFM classification SIMUSAFE
Development of explainable Historical flight data acquired for

Evaluation of XAI methods models for risky and hurried project ARTIMATION
driving behaviour detection
Development of an explainable
model for flight TOT delay
prediction
Development of methods to
generate explanations in different
forms
Development of an evaluation
approach for XAI methods with
synthetic data
Figure 3.1: The inter-connected significant aspects of the research methodology

followed in this thesis work.
estimate the features of the EEG signal from the concurrent vehicular signal.
It would reduce the complexity of recording drivers’ EEG repeatedly while
driving.
• An explainable mental workload assessment model for car drivers was developed
with SHAP. Here, MI was used to relate the auto-encoded features with the
traditional features of EEG and presented with chord diagrams along with the
visual explanation from SHAP.
• Explainable models for classifying drivers’ risky and hurried behaviours were
developed. The explanation was generated with two different Additive Feature
Attribution (AFA) methods: SHAP and LIME. Here, the evaluation of the
feature attribution was done through quantitative metrics that were used in
other domains but resemble similar functionality to AFA.
• The prediction models developed with XGBoost are better than traditional
decision trees, but they lack interpretability. A novel method of devising a
single decision tree to represent the inference of several trees from XGBoost to
retain the interpretability of the models.
• Evaluation of the explainable models is an important topic of XAI research
because of the scarcity of standard metrics and methods to assess the
performance of the explainable models. To support the field of evaluating
explainable models, a method of generating benchmark datasets is being
23
developed to measure the quality of feature attribution by some selected XAI

tools.
The dissemination of the exploratory research works and their connection to the
specific research questions and contributions are discussed in Section 1.4.
Literature Review. To investigate the current research trends in the respective

domains of XAI, RS, and ATFM, several literature studies were conducted. Firstly,
the prevailing uses of different parameters were exploited for drivers’ mental workload
assessment with a view to sorting out the open research issues and possible future
research directions. In addition, state-of-the-art methods that have been exploited
for flight delay prediction in the operations of ATFM were studied. Afterwards,
the available methods for adding explainability to the systems that are already
facilitated with the Artificial Intelligence (AI) or ML models were analysed for further
enhancement and adopted as advanced measures for both the domains of RS and
ATFM. As the research field of evaluating XAI methods is still in its initial state,
a thorough review of the exploited methods was done to propose and develop a
benchmark dataset to evaluate the explanation models.
3.3 Methods
This section presents the summaries of the classifier or regression models, XAI
methods and evaluation metrics invoked across different experimental studies
presented in this thesis. Implementation details are presented in the respective papers
included in this thesis.
3.3.1 Classification and Regression Models

This section contains brief descriptions of the AI or ML models used in different
classification and regression tasks. The hyperparameters of the models were set
specifically in individual experimental studies.
3.3.1.1 Logistic and Linear Regressor

Regression is the simplest supervised ML model that estimates the relationship
between an independent and a dependent variable with statistical analyses (Li, 2019).
Linear Regression (LnR) and Logistic Regression (LgR) are deployed to predict
continuous and binary categorical values, respectively, which align perfectly with
this study. For both regression and classification tasks in corresponding studies,
normalized data has been used. Moreover, for classification, LgR has been trained
with balanced class weights and L2 regularization.
3.3.1.2 k-Nearest Neighbours

In ML, k-Nearest Neighbours (kNN) is recognized for its flexibility and memory-based
nature. kNN stands as arguably the simplest and most extensively utilized classifier.
Unlike methods that necessitate the creation of a model tailored to the data,
kNN operates by leveraging observations in the training set to identify the most
24
similar properties within the test dataset (Larose, 2004). Furthermore, k-NN is
deemed a universally consistent classifier (Luxburg & Schölkopf, 2011), employing the
Euclidean distance metric to identify the k closest neighbours in the dataset for each
instance. Given its reliance on a distance function, explaining the nearest-neighbour
model during predictions is straightforward. However, explaining the inherent
knowledge acquired by the model can be challenging.
3.3.1.3 Support Vector Machine

Support Vector Machine (SVM) was developed by Vapnik (1991). The
working principle of SVM concentrates mostly on finding the hyper-plane which
simultaneously minimises the empirical classification error and maximises the
geometric margins in the classification tasks (Vapnik, 1991). SVM transforms the
true data points from the input space to the high dimensional space that facilitates the
classification task by determining a decision boundary. For prediction or regression
tasks, the decision boundary is used to predict the continuous value or target value
(Jain et al., 2000). SVM-based regression and classification models have a very
good generalisation capability on multidimensional data and dynamic classification
or prediction schemes, which makes it appropriate for the concerned tasks (Guyon
et al., 2002). Moreover, literature shows the deliberate use of SVM in the domain of
EEG signal analysis and MWL assessment (Saccá et al., 2018; Saha et al., 2018; Wei
et al., 2018).
3.3.1.4 Random Forest

Random Forest (RF) is an ensemble method which builds a collection of randomised
decision trees developed from bootstrapped data points and predicts on the basis of
majority voting from all the trees for classification tasks (Breiman, 2001) whereas for
regression tasks, it takes the average of prediction. In addition to that, RF operates
with an underlying feature selection method which removes non-important features
for prediction tasks automatically. RF is implemented using bootstrapping as an
ensemble method.
3.3.1.5 Boosted Decision Trees

Two different models with boosted ensemble models (Sagi & Rokach, 2018) with
decision trees are exploited in the experiments of this research study. The first one
is GBDT, in full Gradient Boosted Decision Trees, which learns on differentiable loss
functions (Zhang & Jung, 2021). The other one is XGBoost – Extreme Gradient
Boosting (T. Chen & Guestrin, 2016), which is a variant of GBDT. It uses the
second-order gradient to improve accuracy. In both models, the training includes the
generation of a specified number of decision trees from the training set. For each
tree, a residual error is calculated, which is iteratively used to train the next tree
with a goal to reduce the error. Thus, the last tree gives the prediction with the least
error theoretically.
25
3.3.1.6 Multilayer Perceptron

Multilayer Perceptron (MLP) (Ramchoun et al., 2016) is a subclass of Artificial
Neural Network (ANN) (Basheer & Hajmeer, 2000) with at least three layers of nodes
- input layer, hidden layer and output layer. Each layer contains nodes, also known
as neurons or perceptions. Neurons in each layer are connected to every neuron in the
subsequent layer. Each neuron in an MLP is associated with an activation function.
Common activation functions include the sigmoid, hyperbolic tangent (tanh), or
rectified linear unit (ReLU). MLP is a feedforward neural network, meaning that
information flows through the network in one direction—from the input layer to
the output layer—without cycles or loops. These models are trained using a process
called backpropagation. During training, the network adjusts its weights based on the
error between the predicted output and the actual output. This process is repeated
iteratively until the model’s performance reaches a satisfactory level. The MLPs
have been trained for both classification and regression tasks in this research, where
the hyperparameters varied in different experiments that are presented in respective
papers.
3.3.1.7 Case-based Reasoning

Case-Based Reasoning (CBR) is a category of AI models that leverage past
experiences to address current problems. According to Mitchell (1997), CBR is
characterized as an instance-based approach and falls under the category of lazy
learning, implying that it refrains from reasoning until it becomes imperative.
Kolodner (1992) conceptualizes CBR as a reasoning system that tackles new problems
by recalling and applying historical situations analogous to the present scenario. In
the context of CBR, a case denotes an experience derived from a previously solved
problem, with cases serving as the foundation for reasoning. The term reasoning
pertains to the method of problem-solving, signifying that CBR aims to resolve a
problem through inferences drawn from previously solved cases (Richter & Weber,
2013). Aamodt and Plaza (1994) outlined the CBR cycle, encompassing four steps:
retrieve, reuse, revise, and retain. In this doctoral study, only the first step (i.e.,
retrieve) is used. The task is to search the case library for cases that resemble the
new problem description with similarity measures.
3.3.2 Explainable Artificial Intelligence Methods

The explainability methods exploited in the studies are popular feature attribution
methods according to the literature (Mainali & Weber, 2023; Islam et al., 2022). All
of these exploited explainable methods are briefly described in this section.
3.3.2.1 SHAP
SHAP – Shapley Additive Explanations (Lundberg & Lee, 2017), is an explainability
tool encompassing mathematical technique that was developed based on the Shapley
values proposed by Shapley (1953) in the cooperative game theory. Shapley values
are a mechanism to fairly assign impact to features that might not have an equal
influence on the predictions. To generate additive explanations for predictions from
black-box models, the concept of Shapley value was incorporated. In delay prediction,
26
to explain the decisions from the model (i.e., prediction), SHAP calculates the
contribution of each feature in the prediction from the model. SHAP is available
as a tool11 for Python, which generates explanations for text and image data with
Explainer implementation. For tabular data, KernelExplainer is model-agnostic and
TreeExplainer for tree-based models both singular and ensembles.
3.3.2.2 LIME
LIME stands for Local Interpretable Model-agnostic Explanations (Ribeiro et al.,
2016). It is a tool that uses an interpretable model to approximate each individual
prediction made by any black box ML model. LIME uses a three-step process to
determine the specific contributions of the chosen features: perturbing the original
data points, feeding them to the black-box model, and then observing the related
predictions. LIME is available as a package12 for Python, which is used to generate
explanations for tabular, image and text data.
3.3.2.3 DALEX
Model-Agnostic Language for Exploration and Explanations, in short, DALEX is
a Python library built upon the software for explainable ML proposed by Biecek
(2018). The main goal of the DALEX tool is to create a level of abstraction around
a model that makes it easier to explore and explain the model. Explanation deals
with two uncertainty levels: model level and explanation level. The underlying idea
is to capture the contribution of a feature to the model’s prediction by computing the
shift in the expected value of the prediction while fixing the values of other features.
In this study, for flight TOT delay prediction, DALEX has been used as a Python
package13 to generate an interactive Breakdown plot, which detects local interactions
of user-selected features.
3.3.3 Evaluation Methods and Metrics

The developed methods have been evaluated based on the categories of the models.
One category contains the AI or ML models, i.e., classifiers and regressors. These
models have been evaluated with appropriate measures and metrics such as Accuracy,
F1 score, Mean Absolute Error (MAE), etc. based on classification or regression
tasks. Another category contains the XAI methods. In their quantitative evaluation,
different metrics are used, such as Normalised Discounted Cumulative Gain (nDCG)
(Järvelin & Kekäläinen, 2002), Spearman’s rank correlation (Zar, 1972), Coefficient
of Quartile Variation (CQV) (Bonett, 2006), etc.
11 https://shap.readthedocs.io
12 https://github.com/marcotcr/lime
13 https://pypi.org/project/dalex
27
Chapter 4
Explainable Artificial Intelligence for

Decision Support Systems
This chapter describes the developed applications of XAI and their

evaluation criteria.
A number of different explainable models are developed with a view to enhancing

the transparency of Decision Support Systems (DSS), particularly for the application
domains of Road Safety (RS) and Air Traffic Flow Management (ATFM). These
application domains were exploited in this thesis work due to the framework of
the research projects mentioned in the preamble of Chapter 3. In addition, these
domains are also least facilitated from current Explainable Artificial Intelligence
(XAI) research (Degas et al., 2022; Islam et al., 2022). To advance the XAI research,
several methods are developed in this thesis work. A novel approach for evaluating
XAI methods, particularly the Additive Feature Attribution (AFA) methods, is
formulated using a synthetic benchmark dataset. In this evaluation approach, an
additive form of Case-based Reasoning (CBR), namely Additive CBR (AddCBR),
is developed and used as the baseline. Another method is developed to interpret
the decisions of the Extreme Gradient Boosting (XGBoost) model’s decision using
its internal structure. The methods, implementation and evaluation results of the
developed models are presented briefly in the subsequent sections. Furthermore,
results from additional experiments that were not incorporated in the papers included
in this thesis are described in this chapter.
4.1 Applications in Road Safety
This thesis work partially contributes to the enhancement of DSS in the domain of
RS by developing two distinct applications that feature explainable models. The first
application is developed for assessing drivers’ in-vehicle Mental Workload (MWL),
while the second application focuses on monitoring drivers’ driving behaviour in
terms of risk and hurry. In the training of these models, a combination of
Electroencephalography (EEG) signals and vehicular data is employed. Novel
29
approaches have been devised to maximise the utility of EEG signals and extract
features that are comprehensive to humans. This includes the development of a Deep
Learning (DL) based feature extraction method for EEG signals. Additionally, a
hybrid template has been developed to effectively combine vehicular signals with EEG
data, leading to a more comprehensive and interpretable model for RS applications.
For both applications, the explainable models are developed with popular XAI
methods and comparatively evaluated using quantitative measures.
Figure 4.1: Schematic diagram of developing explainable DSS for RS.
A generalised schema of the explainable DSS development for RS is illustrated in

Figure 4.1. The first step from the schema has been performed within the framework
of the projects BrainSafeDrive and SIMUSAFE as described in Section 3.1.1. Step 2
of the illustration concerns the development of CNN-AE, which is discussed in Section
4.1.1. The development of the hybrid feature set combining EEG signal and vehicular
data corresponds to Step 3 and is briefly discussed in Section 4.1.2. The classifier
or regression model building indicated in Step 4 is discussed in both Sections 4.1.1
and 4.1.2, which also evaluated the effectiveness of the developed methods of feature
construction. Sections 4.1.3 and 4.1.4 contain descriptions on building explainable
models for both Steps 5 and 6. However, the evaluation of the XAI methods and the
generated explanations are discussed in Section 4.1.4 only. It is worth mentioning that
the models and explanations generated from Steps 4 - 6 are finalised after iterations
of quantitative evaluation. As a whole, the subsequent sections briefly explain the
steps followed to develop explainable DSS for RS and refer to the included papers
for detailed discussion.
4.1.1 Automated Method of Feature Extraction from

Electroencephalography Signals
The literature contains evidence that EEG has emerged as a prominent parameter
for measuring MWL, which is extensively used in diverse research studies (Begum &
Barua, 2013; Aricò, Borghini, Di Flumeri, Colosimo, Pozzi, et al., 2016; Aricò et al.,
30
Chapter 4. Explainable Artificial Intelligence for Decision Support Systems
2017). However, the manual technique of extracting features from EEG signals is
complex, laborious and computationally expensive. To address these challenges, this
study harnessed the computational power of DL, specifically the Autoencoder (AE)
from a Convolutional Neural Network (CNN), to extract features from EEG signals
automatically.
The architecture of the CNN-AE is elaborately described in Section A1.3.3 of
Paper A1. The design of the AE has been evaluated by comparing the AE-extracted
features with manually extracted features for classifying drivers’ MWL into high and
low classes. In particular, to support the effectiveness of these features, four different
classifiers have been employed, including Support Vector Machine (SVM), Random
forest (RF), k-Nearest Neighbours (kNN), and Multilayer Perception (MLP). The
performances of the classifiers have been measured using accuracy, balanced accuracy
and F1 score, which are summarised in Table 4.1. In Section A1.4 of Paper A1,
evaluation results with additional metrics are also reported. The results demonstrate
improvement in the performance of the classifier models when utilising AE-extracted
features compared to manually extracted features, underlining the potential of DL
techniques in MWL assessment. Particularly, SVM has achieved 87.00% classification
accuracy when trained with AE-extracted features, whereas the highest accuracy is
70.83% using manually extracted features and MLP classifier.
Table 4.1: Performance of different models for classifying drivers’ MWL utilising
AE-extracted features compared to manually extracted features. For all the measures,
higher values are better.
Feature Classifier
Metric
Extraction kNN MLP RF SVM
AE 0.7737 0.8504 0.8049 0.8700
Accuracy
Manual 0.6420 0.7083 0.6414 0.5388
Balanced AE 0.7737 0.8504 0.8049 0.8700
Accuracy Manual 0.6420 0.7083 0.6414 0.5388
AE 0.7912 0.8527 0.8197 0.8730
F1 score
Manual 0.6486 0.7151 0.6442 0.5146
4.1.2 Hybrid Feature Set from Electroencephalography Signal and

Vehicular Data
Automated feature extraction with CNN-AE has demonstrated its potential to
enhance drivers’ MWL classification, however, a need for a more user-friendly
alternative to EEG persists. In fact, employing EEG in real-world, naturalistic
driving environments poses significant challenges in terms of data acquisition,
processing, and real-time decision-making. To address this challenge, a novel feature
template is developed, harnessing Mutual Information (MI) theory. This template
captures shared information between vehicular parameters and EEG signals, as
depicted in Figure 4.2. In the illustration, E and V stand for EEG and vehicular data,
respectively, and their corresponding entropy values are represented with H(V ) and
H(E). Concurrently, H(E, V ) and I(E, V ) represent the union of the entropy space
and the MI in the intersecting space, respectively. In practice, a single instance of
31
EEG and vehicular data can be represented respectively by the vectors e and v from
corresponding entropy spaces, and m represents a single instance of I(E, V ), which
is the MI shared by e and v, which provides the template to generate the feature set.
This template enables the replication of EEG features and the generation of hybrid
feature sets by introducing new vehicular data repeatedly. A detailed theoretical
description of the MI-based template generation process is presented in Section A2.3.3
of Paper A2.
Figure 4.2: Illustration of shared information between EEG and vehicular signal
spaces (Islam et al., 2020).
To evaluate the effectiveness of this hybrid feature set, it has been utilised
in drivers’ MWL quantification and classification. MLP, RF, and SVM have
been employed to develop classifiers and predictors, with appropriate regression
models used as needed, i.e., Linear Regression (LnR) for quantification and Logistic
Regression (LgR) for classification. The performance of predictors in MWL
quantification is found to be similar for both EEG and MI-based features. However,
when it came to classification, MI-based features outperform EEG features in event
classification, with the exception of LgR. Sections A2.4 and A2.5 of Paper A2 present
and discuss detailed evaluation results for MWL classification and quantification,
including event classification.
4.1.3 Explainable Model for Drivers’ Mental Workload Classification

An explainable model for drivers’ MWL classification is developed leveraging the
CNN-AE extracted features from EEG signals. To enhance the transparency and
interpretability of the classification process, both global and local explanations are
generated using Shapley Additive Explanations (SHAP) for the trained classifier
and specific classification instances. While SHAP values effectively demonstrate the
contributions of different features to the classification outcome, it is essential to note
that these feature labels are often not intuitively understandable to humans. To
bridge this gap and provide more meaningful insights, MI is harnessed to assess how
the auto-encoded features relate and share information with known features of EEG
signals. This combined approach allows for the presentation of MI values between
AE-extracted features and traditional features with Chord diagrams (Tintarev et
al., 2018) alongside SHAP figures, creating a comprehensive and interpretable
explanation for drivers’ MWL classification. Section C.2 of Paper C contains a
comprehensive description of the entire methodology.
32
4.1.4 Explainable Model for Monitoring Driving Behaviour

In the research study presented in Paper D, explainable models for monitoring
drivers’ risky and hurried driving behaviour have been developed. A range of
classifier models, including Gradient Boosted Decision Trees (GBDT), LgR, MLP,
RF, and SVM, were trained using datasets from simulator tests, track tests, and a
combination of both to classify risky or hurried behaviour. Among these classifiers,
GBDT demonstrated strong performance across datasets and classification tasks.
The classification performances are summarized in the Tables D.6 and D.7 in Paper D.
To enhance transparency and interpretability, explanation models are constructed for
GBDT using SHAP and Local Interpretable Model-agnostic Explanations (LIME).
The quantitative evaluation presented in Section D.3.3 of Paper D reveals that SHAP
outperforms LIME in providing meaningful explanations for the classifier’s decisions.
avg_acce_pedal_pos ≤ 18.225
entropy = 0.488
samples = 66
value = [59, 7]
class = No Risk
False
True
std_steer_angle ≤ 188.938
entropy = 0.0
entropy = 0.887
samples = 43
samples = 23
value = [43, 0]
value = [16, 7]
class = No Risk
class = No Risk
entropy = 0.0
entropy = 0.996
samples = 10
samples = 13
value = [10, 0]
value = [6, 7]
class = No Risk
class = Risky
gsr_phasic ≤ 0.011
entropy = 0.0
entropy = 0.971
samples = 3
samples = 10
value = [0, 3]
value = [6, 4]
class = Risky
class = No Risk
avg_speed ≤ 8.908
entropy = 0.0
entropy = 0.918
samples = 4
samples = 6
value = [4, 0]
value = [2, 4]
class = No Risk
class = Risky
entropy = 0.0 entropy = 0.0

samples = 4 samples = 2
value = [0, 4] value = [2, 0]
class = Risky class = No Risk
Figure 4.3: Decision Tree for detecting risky driving behaviour. The leaf nodes refer
to risk and no risk in driving behaviour, which are coloured with the darkest shade
of blue and orange, respectively. All other nodes are decision nodes containing the
conditions on corresponding features for splitting the decision paths.
33
In addition, the study proposed a comprehensive system with a low-fidelity

prototype for monitoring drivers’ behaviour during simulated driving, incorporating
features like Global Positioning System (GPS) plots, heatmaps, and event markers to
visualise driving behaviour. Additionally, the system offers detailed explanations for
specific risky or hurried events, enabling targeted feedback and instruction to modify
drivers’ behaviour and contribute to a safer road environment. Overall, this research
not only advances simulator technologies by identifying biases and differences in
driving behaviour but also establishes a framework for developing driver monitoring
systems capable of detecting, classifying, and explaining risky or hurried driving
behaviour and its contributing factors.
entropy = 0.918
samples = 66
value = [22, 44]
class = No Hurry
True False
lat_acce ≤ 1.543 std_acce_pedal_pos ≤ 13.913

value = [9, 43] value = [13, 1]
class = No Hurry class = Hurry
hr ≤ 76.39
entropy = 0.0 entropy = 0.0 entropy = 0.0
entropy = 0.536
samples = 3 samples = 1 samples = 13
samples = 49
value = [3, 0] value = [0, 1] value = [13, 0]
value = [6, 43]
class = Hurry class = No Hurry class = Hurry
class = No Hurry
hrv_hf ≤ 0.092
entropy = 0.0
entropy = 0.811
samples = 25
samples = 24
value = [0, 25]
value = [6, 18]
class = No Hurry
class = No Hurry
pitch_rate ≤ 0.804
entropy = 0.0
entropy = 0.592
samples = 3
samples = 21
value = [3, 0]
value = [3, 18]
class = Hurry
class = No Hurry
yaw_rate ≤ 0.038
entropy = 0.0
entropy = 0.297
samples = 2
samples = 19
value = [2, 0]
value = [1, 18]
class = Hurry
class = No Hurry

value = [0, 18] value = [1, 0]
class = No Hurry class = Hurry
Figure 4.4: Decision Tree for detecting hurried driving behaviour. The leaf nodes
refer to hurry and no hurry in driving behaviour, which are coloured with the darkest
shade of orange and blue, respectively. All other nodes are decision nodes containing
the conditions on corresponding features for splitting the decision paths.
34
Apart from the studies presented in Paper D, a separate experiment has been
conducted to assess the influential features in drivers’ behaviour classification for risk
and hurriedness. In this experiment, additional features from Electrocardiography
(ECG) and Galvanic Skin Response (GSR) signals have been incorporated
commencing to the requirements of the project SIMUSAFE. In practice, rule-based
explanations are generated to investigate the features that lead to particular
classifications of risky and hurried driving behaviours. The rule-based explanations
are extracted from separate Decision Trees (DT) trained for each of the classification
tasks. The DTs for classifying risky and hurried driving behaviours are illustrated in
Figure 4.3 and 4.4, respectively.
4.2 Application in Air Traffic Flow Management
The literature indicates various research endeavours on developing Artificial

Intelligence (AI) supported applications for the domain of ATFM (Degas et al.,
2022). Regrettably, despite these efforts, it has yet to achieve full operational status
or deliver tangible benefits to end users, e.g., Air Traffic Controllers (ATCO). The
slow advancement of AI adoption in ATFM can be attributed to the critical nature
of this domain, where human lives are at risk, and safety remains the utmost priority.
Historically, safety in ATFM has been effectively ensured through human-in-the-loop
systems. However, as AI applications seek to find a place in ATFM, they must
be seamlessly integrated into human-centred systems. This integration demands
AI systems that are not only effective but also comprehensible to end users. In
contrast, domains like healthcare and criminal justice have witnessed a growing
interest in AI to aid high-stakes human decision-making, driving the emergence of
XAI (Islam et al., 2022). Simultaneously, within the scope of this thesis work, an
XAI methodology is developed specifically for ATFM. The objective is to enhance the
transparency and interpretability of the DSS tailored for flight Take-off Time (TOT)
delay prediction tasks. This initiative aligns with the broader trend in XAI, which
seeks to bridge the gap between complex AI systems and human understanding,
ensuring that AI technologies can be safely and effectively employed in critical
domains like ATFM. The development of the explainable model for flight TOT delay
prediction is illustrated using a schematic diagram in Figure 4.5 that can also be
adapted for other applications for ATFM.
4.2.1 Explainable Model for Flight Take-Off Time Delay Prediction

Unlike the discussed applications of RS in the previous Section, which are mostly
concentric to classification problems, the context of predicting flight TOT delay
is a regression problem. The literature points out an interesting pattern that
many XAI methods initially designed for classification problems are regularly
employed for various regression tasks. Letzgus et al. (2022) have clarified this
inconsistency, highlighting the inherent distinctions between providing explanations
for regression problems and the more established approaches used for classification
tasks. Recognising this, there is a clear need for dedicated XAI techniques to
effectively tackle the unique challenges presented by regression. Due to the scarcity
of methods specialised in explaining regression problems, explainable models for
35
Figure 4.5: Schematic diagram of developing explainable DSS for ATFM.
predicting flight TOT delay are developed with popular XAI methods, i.e., LIME
(Ribeiro et al., 2016) and SHAP (Lundberg & Lee, 2017). Another explainability
tool, namely Model-Agnostic Language for Exploration and Explanations (DALEX)
(Biecek, 2018), is exploited alongside SHAP and LIME for comparative analysis of
the output from the XAI methods.
Figure 4.6: Performances of ETFMS, GBDT, RF, and XGBoost for flight TOT delay
prediction in terms of MAE in minutes measured at different time intervals to EOBT
in minutes. For MAE, a lower value is better, and the plots for ETFMS and GBDT
are considered as a reference from experimentation performed by Dalmau et al. (2021).
In this study, for the developed explainable model for flight TOT delay prediction,
the data acquisition in Steps 1 and 2 from Figure 4.5 are described in Section 3.1.2.
Two different regression models have been built using RF and Extreme Gradient
Boosting (XGBoost) in Step 3. The built models have been quantitatively evaluated
by comparing their performances with the Enhanced Tactical Flow Management
System (ETFMS) and GBDT developed for the same task by Dalmau et al. (2021).
The developed models in this study outperform the reference models as illustrated
in Figure 4.6. However, XGBoost has been chosen over RF while developing
36
Table 4.2: Local accuracy in terms of MAE and nDCG values for SHAP and LIME
while explaining flight TOT delay prediction. The result is presented for all the test
instances and the top 100,000 instances where the XGBoost model predicted the lowest
error. For MAE, a lower value is better, and for nDCG, a higher value is better. The
best values are highlighted with bold fonts.
XAI Method SHAP LIME

No. of Instances MAE nDCG MAE nDCG
All 3.3 × 10−6 0.806 8.62 0.882
100k 1.1 × 10−6 0.722 4.75 0.847
XAI methods to explain the flight TOT prediction in Step 4. The explainable
methods using SHAP and LIME have been evaluated quantitatively, where SHAP
outperforms LIME. The result of the quantitative evaluation is summarised in Table
4.2. DALEX has not been included in the quantitative evaluation since it only
produces visualisation with its internal values. Finally, in Step 5, three different
explanations have been generated using SHAP (Figure 4.7), LIME (Figure 4.8) and
DALEX (Figure 4.9). These explanations have been evaluated through a user survey
conducted among practising and student ATCOs. The survey protocol and detailed
description of the entire research study for developing and evaluating the explainable
flight TOT delay prediction model are disseminated in a co-authored work (Jmoona
et al., 2023).
Figure 4.7: Explanation for a single instance of flight TOT delay prediction with
feature contributions extracted from SHAP.
4.3 Development of XAI Methods and Evaluation Approach
Advancement of XAI research covers a substantial part of the goal of this doctoral
thesis. To attain this, two different methods of generating explanations for AI or
ML models’ decisions are developed. Furthermore, a robust approach is proposed to
evaluate the AFA methods using a synthetic dataset that captures the underlying
37
feature contributions extracted from LIME.
feature contributions extracted from DALEX.
behaviour of the data. All of these proposed methods are briefly discussed in the
following sections and referred to the corresponding papers included in this thesis.
4.3.1 Evaluation of Additive Feature Attribution Methods

Across the developed explainable models for RS and ATFM discussed in the previous
sections, it has been noted that there remains inconsistency in the explanation
produced by the models. Since all the explainable methods produce explanations
based on AFA methods, these pave the path for investigating AFA methods further
and evaluating them with a plausible technique. Additionally, it’s worth noting
that the evaluation of XAI methods has not received adequate attention (Mainali &
Weber, 2023). To address this gap, a novel approach for evaluating XAI methods
38
producing AFA using synthetic datasets where AddCBR is used as the baseline to
evaluate the XAI methods comparatively.
4.3.1.1 Additive Case-based Reasoning

CBR is frequently employed to produce instance-based explanations (Nugent &
Cunningham, 2005). However, it has limitations when it comes to local explanations,
much like additive models, because it predominantly relies on global weights. Global
weights are conducive to global interpretability, akin to the concept of global CBR.
Nonetheless, an additive variant of CBR, AddCBR, is developed, which adjusts
the values for the CBR regression model post-prediction. When considering the
predictive capabilities of CBR, AddCBR emerges as a promising benchmark for local
interpretability, as explained in Section E.2.5 of Paper E.
4.3.1.2 Evaluation with Synthetic Dataset

A novel approach is developed for evaluating XAI methods using synthetic datasets
retaining the behaviour of the original data. In the process, the use of AddCBR as
the baseline is successfully demonstrated for its value in evaluating AFA methods.
Particularly, the performance of the AFA methods, SHAP and LIME, has been
compared with the performance of the baseline AddCBR in terms of feature ranking,
attribution and impact. Finally, the consistencies of these functionalities have been
investigated with respect to local accuracy. The approach is elaborately described
in Section E.3 of Paper E. For brevity, the schematic diagram of the evaluation
approach is illustrated in Figure 4.10.
Step 1: Step 2: Step 3:

Generate and Evaluate Implement and Develop XAI Method
Synthetic Datasets Evaluate Data Model including Baseline
Step 4: Evaluate XAI Method

Step 5: Feature Ranking
Re-evaluate XAI
Method for Sensitivity Feature Attribution
Analysis
Feature Impact
Figure 4.10: Schematic diagram of evaluating XAI methods for AFA using synthetic
dataset.
The steps shown in Figure 4.10 are briefly described below, and they are presented
in their entirety in Paper E.
Step 1: Generation of a synthetic dataset by capturing the underlying behaviour

of the data, which is used as a benchmark dataset to evaluate XAI
methods. This step will ensure that the generated data will only contain
the characteristics of the data that influence the inference mechanism of the
classifier or regression models. Several datasets are to be generated based
39
on different numbers of distinct behaviours, where one dataset is the main

dataset for evaluation, and others are for sensitivity analysis.
Step 2: Build a data model based on the problem context through training on the
synthetic data generated in the previous step.
Step 3: Develop XAI method producing AFA for explaining the classification or
regression tasks.
Step 4: Evaluate the XAI method on the basis of four aspects, i.e., feature ranking,
feature attribution, and feature impact. The significance of each of the
aspects is discussed in the corresponding discussions in Section E.4.7 of
Paper E.
Step 5: Re-evaluate the XAI methods with the additional datasets generated in
Step 1 for sensitivity analysis.
4.3.2 Interpretable XGBoost

It is commonly known that XGBoost is a popular choice for regression tasks due
to its superior accuracy compared to other tree-based ML models. However, its
interpretability has been a concern, and various XAI methods, such as LIME, have
been employed to provide explanations for XGBoost predictions. These methods
rely on perturbed samples to generate the explanations. To address this issue, a new
approach called iXGB – interpretable XGBoost has been proposed in this thesis.
iXGB utilises the internal structure of XGBoost to generate rule-based explanations
and counterfactuals from the same data that the model trains on for prediction tasks.
The proposed approach has been evaluated on three different datasets in terms of
local accuracy and rule quality, and the results demonstrate that iXGB is capable
of significantly improving the interpretability of XGBoost. The working principle of
iXGB is presented in Section F.2 of Paper F.
40
Chapter 5
Summary of the Included Papers
This chapter presents the summaries of the included papers, the authors’
contributions, and the significant findings.
The papers included in this thesis comprise three journal papers and four
peer-reviewed conference papers. Five of these papers have already been published,
while Papers E and F are under review for publishing in a journal and a conference,
respectively. The subsequent sections present the summary and key findings of the
included papers with the title, authors’ contributions, and publication details. In
addition, the presented contributions in the corresponding papers are mapped with
the research contributions (RC) of this doctoral research, which are described in
Section 1.4.
5.1 Paper A1
Title. Deep Learning for Automatic EEG Feature Extraction: An Application in

Drivers’ Mental Workload Classification (Islam et al., 2019).
Authors. Islam, M. R., Barua, S., Ahmed, M. U., Begum, S. & Di Flumeri, G.
Status. Published in L. Longo & M. C. Leva (Eds.), Human Mental Workload:

Models and Applications. H-WORKLOAD 2019. Communications in Computer and
Information Science, 1107, (pp. 121–135), Springer Nature Switzerland, 2019.
Authors’ Contributions. Islam is the main author of the paper. He developed the
methodology, executed the implementation, analyzed the results, and wrote the paper.
Barua contributed to the study design and experiments. Ahmed and Begum guided
the study and manuscript preparation. Di Flumeri helped acquire the data and
provided feedback on the methodology as an expert in physiological signal processing.
41
Summary. The study presented in this paper was initially motivated by the
need to classify drivers’ mental workload (MWL) intended for applications in
Road Safety (RS) using physiological measures, particularly Electroencephalography
(EEG) signals that are considered a suitable measure for MWL. The study was
further influenced by the urge to automate the feature extraction techniques from
EEG signals, reducing manual methods. The paper explored the use of Deep
Learning (DL) algorithms for automatic feature extraction from the EEG signals
to classify drivers’ MWL. It presents a comparative study on DL-based feature
extraction techniques, specifically the Convolutional Neural Network Autoencoder
(CNN-AE), with traditional manual methods. The results demonstrate that
the CNN-AE approach outperforms traditional methods in terms of classification
accuracy. Particularly, four different models – Support Vector Machine (SVM),
k-Nearest Neighbours (kNN), Random Forest (RF) and Multi-layer Perceptron
(MLP) were used to classify MWL in combination with both CNN-AE and traditional
feature extraction methods. The key results of the evaluation experiments reveal
that the highest value for the Area Under the Receiver Operating Characteristic
Curve (AUC-ROC) reached 0.94 while using features extracted by CNN-AE with
an SVM classifier. In contrast, traditional feature extraction methods yielded a
maximum AUC-ROC of 0.78 with an MLP classifier. Thus, this study highlights the
potential of DL techniques in easing the EEG feature extraction techniques and their
application in real-time scenarios to classify MWL, with implications for monitoring
human participants in various safety-critical domains.
Research Contributions. Paper A1 presents the following contributions of this

doctoral thesis:
• Development of an automated feature extraction technique from EEG signals

to reduce the complexities of traditional manual tasks such as complex signal
processing tasks (part of RC2.1).
• Development of drivers’ MWL classifier models intended for the expert systems
monitoring drivers’ state in the domain of RS (part of RC2.2).
5.2 Paper A2
Title. A Novel Mutual Information Based Feature Set for Drivers’ Mental Workload
Evaluation using Machine Learning (Islam et al., 2020).
Authors. Islam, M. R., Barua, S., Ahmed, M. U., Begum, S., Aricò, P., Borghini,
G. & Di Flumeri, G.
Status. Published in Brain Sciences, 10(8), 551-573, 2020.
Authors’ Contributions: Islam is the main author of the paper. He developed the
methodology, performed the formal analysis, executed the implementation, analyzed
the results, and prepared the original draft of the paper. Barua contributed to
conceptualising the methodology and participated in the discussion while writing
42
Chapter 5. Summary of the Included Papers
the paper. Ahmed and Begum supervised the study and provided feedback on the
manuscript. Aricò, Borghini, and Di Flumeri acquired and curated the recorded EEG
signals used in this study and reviewed the paper as experts in physiological signal
processing.
Summary. Using EEG signals in MWL assessment while driving requires frequent
use of invasive recording equipment on drivers. Moreover, the features extracted from
the EEG signals are less interpretable for general users. To mitigate these issues,
this paper is motivated to develop a novel methodology for creating a feature set by
fusing EEG and vehicular signals together and utilizing the feature set in assessing
drivers’ MWL. The findings of this study include significant changes in MWL due to
different driving environments and patterns reflected in vehicular signals. With these
vehicular signals recorded live while driving and the predefined template containing
the Mutual Information (MI) between EEG and vehicular signals, a hybrid feature set
is generated for drivers’ MWL quantification and classification. The study compared
the performance of different Machine Learning (ML) algorithms, such as Linear and
Logistic Regression, MLP, RF and SVM, in corresponding tasks of MWL assessment
and classifying events. In these tasks, both MI- and EEG-based feature sets were
used. The results of MWL assessment tasks demonstrate that the performances of
the ML models are similar while using MI- and EEG-based feature sets. However,
the result of event classification is better while using the MI-based features. On the
contrary, the outcome of a statistical analysis on the performance of classification
tasks suggests that the SVM classifier with MI-based features performed significantly
better in both tasks compared to the other classifiers. This indicates that using
MI-based features can be a viable alternative to EEG-based features for evaluating
MWL and classifying events in driving scenarios.
Research Contributions. The following contributions are disseminated through

the Paper A2:
• Development of an approach to creating a comprehensive feature set using MI

from EEG and vehicular signals to reduce the invasive EEG recording apparatus
(part of RC2.1).
The findings of this paper regarding the use of MI motivated its use in
explaining decisions of MWL classifiers with features that are comprehensible
to end-users.
• Development of drivers’ MWL quantification and classification models. (part
of RC2.2).
5.3 Paper B
Title. A Systematic Review of Explainable Artificial Intelligence in Terms of

Different Application Domains and Tasks (Islam et al., 2022).
Authors. Islam, M. R., Ahmed, M. U., Barua, S. & Begum, S.
43
Status. Published in Applied Sciences, 12(3), 1353-1390, 2022.
Authors’ Contributions. Islam, being the paper’s main author, planned and
conducted the literature review, prepared the original manuscript, and wrote the
discussion section by consulting with the co-authors. Ahmed and Barua suggested and
scrutinized several articles to include in the review study. Begum provided feedback
while preparing the manuscript.
Summary. The research on XAI has emerged with various studies exploring the
philosophy and methodologies of explaining AI models. Despite this, there remains
a noticeable dearth of secondary studies focused on the application domains and
tasks, serving as an entry point for researchers from diverse fields to integrate
XAI methods. To fill this gap, this paper presents a systematic literature review
of recent developments in XAI methods and evaluation metrics across various
application domains and tasks. The analysis covers 137 articles identified from
prominent bibliographic databases, providing several key insights. The findings
reveal a predominant development of XAI methods for safety-critical domains like
healthcare, with comparatively less attention given to domains such as judiciary, road
safety, aviation, etc. Additionally, DL and ensemble models are more prevalent than
other AI or ML models. Visual explanations prove more acceptable to end-users,
while robust evaluation metrics for assessing explanation quality are still in the
developmental stage.
Research Contributions. The contributions of Paper B to this doctoral thesis

are:
• An extensive summary of the XAI methods that are exploited in different

applications and domains (RC1.1).
• A review on the development of evaluation approaches for XAI methods (part
of RC1.2).
5.4 Paper C
Title. Local and Global Interpretability using Mutual Information in Explainable

Artificial Intelligence (Islam et al., 2021).
Authors. Islam, M. R., Ahmed, M. U. & Begum, S.
Status. Published in Proceedings of the 8th International Conference on Soft

Computing & Machine Intelligence (ISCMI), 191–195, 2021.
Authors’ Contributions. Islam led the study and is the paper’s main author.
He developed the methodology, conducted the experiments, and prepared the original
manuscript. Ahmed and Begum supervised the study and provided feedback on the
manuscript preparation.
44
Summary. The motivation of the paper is to address the need for Explainable
Artificial Intelligence (XAI) in the context of mental workload classification using
EEG data. The authors propose a hybrid approach that utilizes MI to explain the
inference mechanism and decisions of AI or ML models. This approach involves
a convolutional autoencoder for feature extraction, a classification model, and the
use of MI to provide global and local interpretability. The study demonstrates the
application of this approach in classifying drivers’ mental workload using EEG data,
showing promising performance accuracy and the ability to explain the model’s
behaviour using MI and SHAP values. The paper highlights the potential of the
proposed approach in providing interpretable explanations for the model’s decisions
and the need for further research to explore other DL architectures and improve the
quality of explanations.
Research Contribution. This paper contributes to this thesis by presenting

the development of a model for drivers’ MWL classification and generation of
explanations at local and global scope with a human-understandable feature set (part
of RC2.2).
5.5 Paper D
Title. Interpretable Machine Learning for Modelling and Explaining Car Drivers’
Behaviour: An Exploratory Analysis on Heterogeneous Data (Islam et al., 2023).
Authors. Islam, M. R., Ahmed, M. U. & Begum, S.
Status. Published in Proceedings of the 15th International Conference on Agents

and Artificial Intelligence (ICAART), 2, 392-404, 2023.
Authors’ Contributions. As the main author of the paper, Islam led the study,
developed the methodology, conducted the experiments, and prepared the original
manuscript. Ahmed and Begum provided general supervision and feedback on the
paper writing.
Summary. The paper presents a study that explores the variation of drivers’
behaviour in a simulator and track driving to enhance simulator technologies, which
are widely used in the domain of RS. The study includes a comparative analysis of
car drivers’ behaviour in a simulator and track driving for different traffic situations.
The outcome of the comparative analysis identifies biases and differences in driving
behaviours between the two driving environments. Five different ML classifier models
(i.e., Gradient Boosted Decision Trees (GBDT), Logistic Regression, MLP, RF and
SVM) are developed to classify risk and hurry in drivers’ behaviour. The results
demonstrate that among the classifiers, GBDT performed best with a classification
accuracy of 98.62%. The study also develops explanation models based on additive
feature attribution (AFA) to explain the decisions made by the classifier models.
These explanations provide insights into the factors and features contributing to risky
or hurried driving behaviour, allowing for a better understanding of the underlying
45
causes. Lastly, the study proposes a system for drivers’ behaviour monitoring in
simulated driving. This system includes features, e.g., Global Positioning System
(GPS) plots, heatmaps, and event markers to visualize driving behaviour. It also
provides explanations for specific risky or hurried events, allowing for targeted
feedback and instruction to modify drivers’ behaviour and create a safer road
environment. Overall, this study contributes to the enhancement of simulator
technologies by identifying biases and differences in driving behaviour. It also
provides a framework for developing driver monitoring systems that can detect and
classify risky or hurried driving behaviour, as well as explain the underlying factors
contributing to these behaviours, thus allowing for targeted interventions and training
to improve RS.
Research Contribution. Explainable models for classifying and interpreting

drivers’ in-vehicle behaviour, i.e., risky and hurried driving behaviour (RC2.3). In
addition, a low-fidelity prototype of the application of these models is presented as
an example of transparent DSS for RS.
5.6 Paper E
Title. Investigating Additive Feature Attribution for Regression.
Authors. Islam, M. R., Weber, R. O., Ahmed, M. U. & Begum, S.
Status. Under review for publishing in Artificial Intelligence, 2023.
Authors’ Contributions. Islam led the study and is the main author of the paper.
He designed the study, performed implementation, analyzed the results, prepared the
illustrations, and wrote the whole paper. Weber guided in designing the study, result
analysis and presentation, and manuscript preparation. Ahmed and Begum provided
general supervision and feedback on the manuscript.
Summary. The literature on XAI has produced studies showing that explainable
methods for feature attribution produce inconsistent results. This inconsistency in
explanations makes the evaluation of XAI methods crucial, but the existing body
of literature on evaluation techniques is still immature, with multiple proposed
techniques and lacks a consensus on the best approaches for each circumstance.
Moreover, there is a lack of widely accepted evaluation methods for explaining
the decisions of AI algorithms. This paper investigates an approach to creating
synthetic data that can be used to evaluate methods that explain the decisions of
AI algorithms. From a real-world dataset, the proposed approach describes how to
create synthetic data that preserves the patterns of the original data and enables
comprehensive evaluation of XAI methods. Particularly, the proposed approach is
described for the explainable methods that produce AFA to describe the contribution
of individual features in decision-making. The application of the proposed approach
is illustrated in predicting flight take-off (TOT) delays. The results of primary
and sensitivity analysis show that the performances of the AFA methods align
46
with previous literature for regression tasks. Additionally, the additive form of
Case-based Reasoning (CBR), namely AddCBR, is derived. Evaluations in the
paper demonstrate that AddCBR serves as a suitable benchmark for evaluating AFA
methods. In the entirety, this paper contributes to the advancement of evaluation
techniques for XAI methods and provides insights into the performance of AFA
methods using the proposed synthetic data approach.
Research Contributions. Paper E presents several key contributions of this

thesis work:
• Summary of the XAI research in terms of evaluation of the explainable methods

(part of RC1.2).
• Developed an explainable model for flight TOT delay prediction intended for
the domain of Air Traffic Flow Management (ATFM) (RC2.4).
The description of the model is not elaborately presented in the paper since the
manuscript is prepared with the core contribution to XAI research. However,
Section 4.2.1 discusses the development of the flight TOT delay prediction
model.
• Developed AddCBR for generating AFA to explain an AI model’s decision and
for evaluating other AFA methods as a baseline (RC3.1).
• Proposed and demonstrated an approach for evaluating AFA methods using
the intrinsic behaviour of the data (RC3.2).
5.7 Paper F
Title. iXGB: Improving the Interpretability of XGBoost using Decision Rules and
Counterfactuals.
Authors. Islam, M. R., Ahmed, M. U., & Begum, S.
Status. Under review for publishing at the 16th International Conference on Agents
and Artificial Intelligence (ICAART), 2024.
Authors’ Contributions. Islam developed the idea, implemented the proposed

approach, conducted the evaluation experiments, and prepared the manuscript.
Ahmed and Begum supervised the study, validated the concept and provided feedback
on manuscript preparation.
Summary. The paper discusses the challenges of interpretability in tree-ensemble

models like Extreme Gradient Boosting (XGBoost) and proposes an approach called
iXGB – interpretable XGBoost. While XGBoost is known for its high prediction
accuracy, it is less interpretable compared to traditional tree-based models. To add
interpretability, surrogate models are generally used to explain the model’s decision
using an arbitrary interpretable model and synthetic data generated from sampling
47
the original data. On the other hand, iXGB aims to improve the interpretability
of XGBoost by generating a set of rules from its internal structure and the original
data characteristics. The approach also includes generating counterfactuals to aid in
understanding the operational relevance of the rules. The paper presents experiments
on both real and benchmark datasets, demonstrating reasonable interpretability of
iXGB without using surrogate methods.
Research Contribution. The contributions of Paper F are listed below:
• A novel method to explain the decisions of XGBoost without using a surrogate

XAI method (part of RC3.1).
• An explainable flight TOT delay prediction model that explains the decision
with sets of rules and counterfactuals (part of RC2.4).
The paper presents the explanation generation, whereas the development of the
prediction model is reported in Section 4.2.1.
48
Chapter 6
Discussions, Conclusion and Future

Works
This chapter discusses the findings in the context of the research questions,
summarises the study’s main findings, presents the limitations of this study,
and suggests areas for future research.
This doctoral research has been conducted with the aim of advancing the research for
Explainable Artificial Intelligence (XAI) and enhancing the transparency of Decision
Support Systems (DSS). During the study, several explainable models are developed
and evaluated, which are intended for different expert systems, in a broader sense,
DSS, for the selective domains of Road Safety (RS) and Air Traffic Flow Management
(ATFM) to further the developments of DSS. The following sections present the
discussions on the findings of the doctoral study, including the answers to the RQs
defined in Section 1.3, present the challenges and limitations of this study, followed
by the concluding statements with directions to future research.
6.1 Discussions
A DSS is generally apprised as a specialized information system designed to assist

humans in the decision-making tasks of solving complex problems and making
informed choices. There are various types of DSS, such as expert systems,
recommender systems, etc., each catering to specific decision-making needs (Arnott
& Pervan, 2014). DSS plays a crucial role in decision-making processes by providing
valuable insights, facilitating data analysis, and enhancing overall decision quality.
Artificial Intelligence (AI) models form the foundation of the DSS, leveraging
advanced computational capabilities (Negnevitsky, 2004). However, a common
challenge with AI-based DSS is their lack of transparency, making it difficult for
users to understand the rationale behind the systems’ decisions or recommendations.
The need for transparency in DSS is paramount, especially in safety-critical domains
like aviation, defence, finance, judicial, medical, transport, etc. (Gunning & Aha,
2019; Islam et al., 2022). To address this, transparency should be incorporated into
49
the design of DSS, allowing users to tailor the level of transparency based on their
specific requirements. Achieving transparency by design involves integrating clear
explanations of AI models, thus enhancing trust in the intelligent systems (Miller,
2019). In this thesis, transparency is achieved through a series of exploratory research
studies, ensuring that the outcomes are not only effective but also comprehensible to
users in the domains of RS and ATFM. These domains are chosen for this doctoral
research since the studies have been supported by the research projects mentioned
in Chapter 3.
Initially, within the frameworks of the projects SIMUSAFE and BrainSafeDrive
from the domain of RS, the study has revolved around assessing drivers’ in-vehicle
state and behaviours. The initial research works address the need to classify
drivers’ mental workload (MWL) with a focus on applications in RS. It leveraged
physiological measures, particularly Electroencephalography (EEG) signals, which
are deemed suitable for MWL assessment. A key motivation is to automate the
feature extraction process from EEG signals, reducing reliance on manual methods.
Given the challenges of using EEG signals for MWL assessment, including the invasive
nature of recording equipment and less interpretable features, the study introduces
a novel methodology. This method involves creating a feature set by combining
EEG and vehicular signals, aiming to enhance the interpretability and efficiency of
assessing drivers’ MWL from the perspective of both features and decisions. The
project ARTIMATION dealt with the ATFM as a specific application domain in
Aviation. The study for ATFM explores the use of XAI methods in explaining flight
take-off time (TOT) delay to Air Traffic Controllers (ATCO) predicted by ML-based
predictive models. Here, three post-hoc explanation methods are employed to explain
the models’ predictions. Quantitative and user evaluations are conducted to assess
the acceptability and usability of the XAI methods in explaining the predictions to
ATCOs as suggested in the literature (Liu et al., 2021; Troncoso-García et al., 2023).
After conducting a series of studies involving the development of explainable
and interpretable models for RS and ATFM with different XAI methods from
the literature, various inconsistencies in the existing XAI methods and evaluation
approaches have been observed. These inconsistencies are also evident in the
literature, such as the limitations of using XAI methods for regression problems,
whereas they are designed for classification problems (Letzgus et al., 2022). By
addressing these inconsistencies, the research in XAI is advanced through the
later experiments in this doctoral study. Particularly, novel methods for Additive
Feature Attribution (AFA), rule and counterfactual explanations are proposed. The
performances of the proposed models are presented in the corresponding papers that
demonstrate better output than the existing methods from the literature. In addition,
a robust method of evaluating AFA methods is also put forward to address the need
for a plausible consensus of evaluation approach in XAI (J. Zhou et al., 2021).
As a whole, the doctoral thesis produced explainable models for applications
of two different safety-critical domains. In addition, novel XAI methods and an
evaluation approach are developed, which contribute to the core body of research in
XAI. All the outcomes of this thesis are concentric on enhancing transparency in DSS,
which is a key requirement that AI systems should meet in order to be trustworthy
to end-users (AI HLEG - European Commission, 2019).
To summarise the outcome of this doctoral research, the answers to the asserted
RQs are discussed in this section with references to the corresponding sections and
50
Chapter 6. Discussions, Conclusion and Future Works
included papers. Additionally, the issues raised while conducting the presented
research studies of this thesis are discussed, followed by the limitations of this doctoral
research study.
6.1.1 Discussion on Research Question 1

How are the XAI methods implemented and evaluated across various application
domains?
Explainability techniques are essential to ensure transparency and trustworthiness in

AI applications. The models in AI applications can often be complex and intricate,
making it challenging to understand how they reach a particular decision. This lack
of transparency leads to scepticism and mistrust among users and stakeholders. The
techniques for incorporating explainability into AI applications can bridge this gap
by providing insights into the decision-making processes of AI models. They help
in revealing the internal mechanism of these models, making the decision-making
process more interpretable and understandable for both technical and non-technical
users. Moreover, explainability is not just a matter of convenience; it’s a necessity
for ethical, legal, and regulatory reasons. Various laws and regulations mandate
that AI systems have to be transparent and must provide explanations for their
decisions (Wachter et al., 2018; Gunning & Aha, 2019; Xu et al., 2019). Furthermore,
explainability techniques play a critical role in bias detection and mitigation, ensuring
fairness and accountability in AI applications (Miller, 2019; Hoffman et al., 2023).
In summary, these techniques are the key aspects of responsible and ethical AI
development with regulatory compliance and promoting user trust.
To answer the RQ1, several secondary investigations have been conducted and
presented in five research papers that are included in this thesis. Among the
papers, Paper B contains a systematic literature review on the development and
evaluation of explainability techniques for applications and tasks across different
domains. Precisely, Paper B (Section B.5.3) contains the outcome of the review
study on the development of XAI methodologies in terms of the input data, AI models
used for the primary tasks of classification or regression, methods of explainability
including their evaluation approaches, and the forms of explanation. In addition,
the evaluation approaches for the XAI methods are also explored and discussed in
Paper E (Section E.2.3). In the preambles of Papers C and D, the research works
on the use of explainable techniques in the domain of RS have been addressed. For
the applications in ATFM, the advancement of AI, let alone XAI, has been unrushed
since the domain prioritizes the safety of human lives at top (Degas et al., 2022).
Still, there are a few works on the development of XAI methods for ATFM that
are discussed in Section 2.3.1. It is worth mentioning that, in both the domains
of interest, i.e., RS and ATFM, scarcity of research works on XAI is prominent in
the literature, which is also presented and discussed in Paper B (Sections B.5.2 and
B.6.2.1). In essence, the answers to the RQ produced literature reviews that can serve
as an entry point for researchers from diverse domains to integrate XAI methods in
their respective domains, thus complementing the entire RC1, i.e., a summary of
the current developments in XAI methods and evaluation approaches in different
application domains and tasks.
51
6.1.2 Discussion on Research Question 2

How can the XAI methods be exploited to enhance transparency in DSS?
By design, the RQ spans several aspects of DSS, such as data transparency and
XAI methods for generating explanations and their evaluation. In order to address
each aspect, particular sub-RQs have been formulated that collectively correspond
to the RC2 and RC3 of this thesis. These contributions are concentric to the
development of explainable models that are both domain-specific and independent.
The following subsections present the discussion on each of the sub-RQs of RQ2.
6.1.2.1 Discussion on Research Question 2.1

How can the feature extraction techniques be enhanced to develop features that
are comprehensible to humans?
The RQ2.1 has been established as a complementary question to induce explainability

to DSS, especially for the domain of RS that utilises the EEG signals for driver
monitoring. Automated feature extraction from EEG signals is crucial as manual
methods are time-consuming and demanding (Tzallas et al., 2009). Moreover, EEG
signals during natural driving impose complexity due to equipment and in-vehicle
systems (Solovey et al., 2014). However, in-vehicle EEG recording is unfavourable
for natural driving. Thus, an approach for monitoring drivers’ MWL and behaviour
with minimal EEG signal utilisation is needed. This approach can facilitate efficient
assessment of MWL and behaviour during driving tasks, saving time and effort while
maintaining accuracy. Furthermore, an automated and enhanced feature set would
facilitate the explainability methods to generate explanations more comprehensibly
for the intended user.
To address this research question, in Paper A1, an automated technique with
autoencoder (AE) has been proposed and demonstrated successfully for extracting
features from EEG signals. Notably, the accuracy of classifying the drivers’ MWL
increases using the AE-extracted features. In another study, a novel Mutual
Information (MI) based template for presenting the features from EEG signals is
developed using vehicular signals. The methodology is described in Section A2.3.3
of Paper A2. Collectively, these two papers present the RC2.1 and address the
particular requirement of data transparency from the guidelines for trustworthy AI
(AI HLEG - European Commission, 2019). Furthermore, the implications of MI in
devising the novel feature template motivated its use in generating explanations with
AE-extracted features in the study presented in Paper C.

How can the explanations be derived for the decisions of AI or ML models?
Deriving explanations for black box models to enhance transparency in DSS is a

multifaceted challenge that has gained significant attention in recent years. Various
techniques and approaches have emerged to address this issue. One common
strategy is to utilise post-hoc explainability methods, which generate explanations
after the model has made a prediction. These methods include techniques like
52
LIME – Local Interpretable Model-agnostic Explanations (Ribeiro et al., 2016)

and SHAP – Shapley Additive Explanations (Lundberg & Lee, 2017), which
approximate the model’s behaviour for specific instances. Another approach involves
using inherently interpretable models, such as decision trees or linear regression,
in lieu of black box models, although this might come at the cost of predictive
accuracy (Gunning, 2017). Additionally, researchers are exploring the integration
of transparency mechanisms directly into complex models, like Neural Networks
(NN), by designing attention mechanisms and saliency maps that highlight the most
influential features. Altogether, a combination of post-hoc explainability techniques,
interpretable models, and model design modifications are key strategies for deriving
explanations in black box models to enhance transparency in DSS, ensuring that
AI-driven decisions are understandable and trustworthy. Precisely, the prominent
methods of deriving explanations for DSS based on the presentations of explanations
are counterfactuals (e.g., DiCE by Mothilal et al. (2020), MACE by W. Yang et al.
(2022)), example-based (e.g., Nugent and Cunningham, 2005; Kenny and Keane,
2019), feature attribution (e.g., LIME by Ribeiro et al. (2016), SHAP by Lundberg
and Lee (2017), DeefLIFT by Shrikumar et al. (2017)), instance attribution (e.g.,
HyDRA by Y. Chen et al. (2021), influence function by Koh and Liang (2017),
RelatIF by Barshan et al. (2020)) and method specific presentations such as paths
of a decision tree (e.g., Izza et al., 2022).
Particularly, this thesis work targets the application domains of RS and ATFM.
Based on the nature of the available data and required explanation, i.e., contributions
of the important features on the decision, and AFA methods were developed for the
respective applications. Mostly, LIME and SHAP, being the most popular tools for
generating Additive Feature Attribution (AFA) (Islam et al., 2022; Mainali & Weber,
2023), were chosen to be exploited in this study. For the drivers’ Mental Workload
(MWL) assessment, in Paper C, SHAP has been used in association with MI to
generate global explanations and presented using Chord diagrams (Tintarev et al.,
2018). On the other hand, LIME and SHAP were both used to generate explanations
on drivers’ behaviour monitoring in Paper D. However, the default illustrations from
these methods were simplified by generating the plots with the feature contribution
values extracted from the methods. Furthermore, different studies were carried out
for the domain of ATFM, which produced methods of generating explanations for
flight TOT delay prediction. The simplification of the illustration of LIME and
SHAP facilitated the user survey for evaluating the explanation of flight TOT delay
prediction described in Section 4.2.1. Paper E presents the use of LIME, SHAP, and
Additive CBR (AddCBR) in explaining the predicted flight TOT delay. Notably,
AddCBR, a novel method for producing AFA, is developed in this research study.
Finally, in Paper F, iXGB is proposed to interpret the decisions of Extreme Gradient
Boosting (XGBoost) models with sets of rules and counterfactuals.
The outcome of the studies performed to address the RQ2.2 contributed to several
contributions of this thesis, which concerns the development of explainable models
for the domains of RS and ATFM. Particularly, the explainable models developed
for RS correspond to RC2.2 and RC2.3, whereas the explainable model for ATFM
corresponds to RC2.4. Most importantly, the novel methods of generating different
forms of explanation conform to the RC3.1.
53

How can the explainability methods be evaluated with plausible techniques?
The major challenge of evaluating an explainability method is to develop an approach

that can deal with the different levels of expertise and understanding of users
(Doshi-Velez & Kim, 2017; Gunning, 2017; Mueller et al., 2019; Barredo Arrieta
et al., 2020; J. Zhou et al., 2021). Generally, these two characteristics of users vary
from person to person. Establishing a proper methodology is inevitably necessary
for evaluating the explainability methods based on the intended users’ expertise and
capacity. A recent secondary study of 187 articles on the development of explainable
methods categorised the methods either as evaluated or not evaluated. There were
35 papers (19%) that evaluated the developed explainable methods and 152 papers
(81%) that did not evaluate them (Mainali & Weber, 2023). Particularly, regarding
the feature attribution methods, as it is widely adopted in this doctoral research,
the literature shows that XAI methods for feature attribution produce inconsistent
results (Letzgus et al., 2022). This inconsistency in explanations makes the evaluation
of XAI methods crucial.
To complement the scarcity of plausible evaluation approaches for XAI methods,
particularly the AFA methods, RQ2.3 has been established and addressed in the
Papers D, E, and F. The methods developed for explaining the drivers’ behaviour in
Paper D were evaluated using nDCG values and Spearman’s correlation coefficient
for the feature ranks based on the contributions to the prediction. In Paper E,
a hybrid approach is demonstrated to evaluate the AFA methods using a synthetic
dataset and a baseline. The baseline is the AddCBR that is developed in this doctoral
study, and arguments for selecting it as the baseline are described in detail in Section
E.4.5.1 of Paper E. Lastly, iXGB, the methods for interpreting the XGBoost model’s
decisions, is evaluated with quantitative experiments that are presented in Paper F.
In summary, the performed experiments to address the RQ2.3 that correspond to the
RC3.2 of this thesis, which is to develop robust evaluation approaches for evaluating
XAI methods.
6.1.3 Discussion on Research Related Issues

Throughout the course of this thesis work, a number of crucial research issues were
identified and resolved. In the following discussion, these issues are presented,
highlighting the significant strides made towards overcoming them.
Dataset Acquisition. The datasets exploited in this doctoral study are from
two different domains, which were acquired within the framework of three different
research projects. For each of the datasets, the acquisition procedure was different,
as described in Section 3.1. In addition, the nature and content of the raw data
were also diverse across the domains. For the data from RS, the features include
vehicular signals, physiological signals and annotations from domain experts. On
the other hand, the data for ATFM contained different features related to aviation
operations. For all the datasets, domain-specific knowledge regarding the features
was required to preprocess and exploit the acquired data for the experimental studies
presented in this thesis. Moreover, additional efforts were required to adapt to
54
different data repositories like IBM Cloud1 and EUROCONTROL Aviation Data
for Research Repository2 and their corresponding data formats. These challenges
have been resolved through consultation with the respective domain experts from
the collaborating institutes in the research projects. It is worth mentioning that,
during the doctoral studies, a data collection experiment was planned within the
framework of the project SIMUSAFE, the protocol was designed, and ethical approval
was received from respective authorities. However, the experiment was postponed
due to the Coronavirus pandemic. Nevertheless, the study protocol is disseminated
in a co-authored article (Ahmed et al., 2021).
Choice of Models for Classification and Regression. One notable challenge

associated with the utilization of AI or ML pertains to the selection of an appropriate
model for addressing a classification or regression problem. When addressing such
problems, two distinct approaches, namely discriminative and generative methods,
are available for training a model. Discriminative methods typically exhibit superior
predictive performance compared to generative methods. Generative methods, on
the other hand, prove more beneficial when dealing with unlabelled data (Bishop
& Lasserre, 2007). The research conducted in this thesis involved labelled datasets,
facilitating the preference for discriminative algorithms over generative ones. The
decision to opt for a discriminative model enables the direct resolution of the
classification or regression problem without the need for an intermediate step, such
as determining the joint probability p(X , Y), where X represents the inputs, and Y
is the label (Ng & Jordan, 2001). In the research outlined in this thesis, supervised
Machine Learning (ML) algorithms, including Linear and Logistic Regression,
k-Nearest Neighbours, Support Vector Machines, Random Forest, Gradient Boosted
Decision Trees, Extreme Gradient Boosting (XGBoost), Multi-layer Perceptron, and
Case-based Reasoning, were implemented and subsequently compared in the initial
studies of this doctoral research. Later in the studies, only a tree-based ensemble
method (i.e., XGBoost) has been exploited, given its potential in modelling tabular
data. In summary, the models have been chosen for the whole study based on their use
in the literature from the corresponding domains and insights into the advantages and
drawbacks of the models from the works of Kotsiantis (2007), T. Chen and Guestrin
(2016), Singh et al. (2016), Choudhary and Gianey (2017), and Sagi and Rokach
(2018).
Choice of XAI Methods. In the research studies of this thesis work, the
explanations are generated using AFA methods, i.e., LIME and SHAP. However,
there remains a consensus on appropriate evaluation metrics to evaluate the quality
of feature attributions from the adopted methods. This is because of the fact that
there is a lack of ground truth or ideal attribution values to evaluate the AFA methods
(Y. Zhou et al., 2022). As a consequence, the literature contains the use of different
metrics or methods for evaluating AFA methods. The use of gold features by Ribeiro
et al. (2016) was the closest form of ground truth, which are the most important
features used by the prediction models. To mitigate the issue of ideal evaluation
1 https://www.ibm.com/cloud
2 https://www.eurocontrol.int/dashboard/rnd-data-archive
55
criteria, this thesis work devised the evaluation method presented in Paper E, which
relies on the synthetic dataset that contains the behaviour of the data. Here, the
behaviour of the data is hypothesized as the ground truth or benchmark that should
be followed in feature attribution.
User evaluations of explanations. User evaluations play a crucial role in

assessing the effectiveness of explanations. Troncoso-García et al. (2023) emphasises
the significance of user evaluations in assessing assumptions and intuitions underlying
effective explanations. Such strong advocacy for user evaluations is also evident in
the literature as they provide valuable insights into evaluating explanation methods
(B. Kim et al., 2018; van der Waa et al., 2018; Nauta et al., 2023). However, in reality,
it is difficult to conduct user evaluations due to the unavailability of the practitioners
in the target domain of the application. During the development of the explainable
model for flight TOT delay prediction, a brainstorming session was arranged at the
Ecole Nationale de l’Aviation Civile (ENAC) – National School of Civil Aviation3 ,
Toulouse, France, with the practising and student ATCOs to gather potential use
cases and the requirement for explanations in their existing DSS. Surprisingly, some
of the veteran ATCOs are against modification of their existing interface. They
believe that modification of the existing system would increase their workload with
increased information, thus reducing their efficiency. Nonetheless, the practising
ATCOs, when insisted, and the students participated in defining the scenarios and
requirements for an explanation of the flight TOT delay. After the development,
at the time of user evaluation, there was less availability of the practising ATCOs
(Jmoona et al., 2023). Because of the lesser span of qualitative evaluation, the focus
of this thesis work shifted towards devising robust quantitative evaluation.
6.1.4 Limitations
This doctoral thesis work is a combination of several studies. These studies are not
mutually exclusive in their entirety and in the nature of their applications. However,
the studies have several limitations that are briefly stated below.
The study presented in Paper A1, resulted in the development of an AE to extract
features from the EEG signals. The working principle of the AE is confined to
extracting EEG features only without any provision to extract features from other
physiological signals such as electrocardiography or galvanic skin response signals.
Moreover, the dataset acquired for the study is suitable for classification tasks by the
experimental design. Thus, the dataset only facilitated the evaluation of the features
extracted from AE on classification tasks.
Different ML algorithms were invoked to develop models for quantifying and
classifying drivers’ MWL and event classification in Paper A2. These models were
used to assess the effectiveness of the developed hybrid feature set through evaluation
carried out by comparing the performances of the models with the other models
developed in the study. This absence of comparison with baseline models remains a
limitation of this study.
Explainable models are developed for drivers’ MWL assessment and driving
behaviour monitoring, which are presented in Papers C and D, respectively. The
3 https://www.enac.fr
56
explanations generated from the explainable models are quantitatively evaluated only
because of the unavailability of appropriate users of the developed systems. However,
it is suggested in the literature that user evaluation provides valuable insights into
evaluating explanation methods through factors such as expertise, understanding,
and personal preferences (B. Kim et al., 2018; Nauta et al., 2023). The limitations of
the studies developing explainable models for RS concerns the qualitative evaluation
of the models.
Paper E presents the AddCBR that is created as a functionally equivalent model
to XGBoost. During the process, the feature importance values from XGBoost are
used as the weights for the CBR model before transforming it to the additive form.
In this step, XGBoost can be replaced by any other model that produces similar
importance values for the features. However, AddCBR can not produce feature
attribution for the decisions of the models that train on an abstract representation
of features (e.g., NN). This remains a limitation of this thesis work.
6.2 Conclusion and Future Works
The exploratory research studies presented in this thesis work advance the XAI
research by developing explainable applications for the DSS from two safety-critical
domains, i.e., RS and ATFM. Besides, novel methods of explaining AI models’
decisions are proposed, including a robust approach for evaluating the AFA methods.
In addition, for the domain of RS, reducing the resource usage for monitoring
drivers while driving and extracting human-understandable features are noteworthy
outcomes of this doctoral research. The findings of the presented studies highlight
the potential of XAI methods in creating transparent DSS for different safety-critical
domains other than RS and ATFM. Though the transparency of the DSS is subjective
to the end-users, it can be enhanced from the algorithmic point of view through
theoretically grounded evaluation approaches, which is demonstrated in this thesis
work.
This doctoral thesis is comprised of several studies from different perspectives
with a common goal of advancing the XAI research. Hence, the recommended future
works also circulate around the advancement of the developed XAI methods and
their evaluation approaches, which are outlined below.
• In Paper E, synthetic data is generated to capture the intrinsic behaviour of
the data, which is used to evaluate the AFA methods. During the process of
synthetic data generation, different behaviours in the data are identified by
the density-based clustering method. It would be interesting to investigate
other unsupervised methods (e.g., AE) or generative modelling methods (e.g.,
Generative Adversarial Networks) for identifying the different behaviours in the
data and generating the synthetic data.
• One of the limitations of this thesis is about the initial feature weights for
creating AddCBR. In the reported study, AddCBR receives the feature weights
from a tree-based model that restricts its incorporation with data models of
different working principles. Potential future research could be to formulate
methods of extracting the initial feature weights from models other than the
tree-based models, both singular and ensembles.
57
• In this thesis, two different XAI methods are proposed: AddCBR for generating
AFA from the predictions of tree-based models and iXGB for generating rules
and counterfactuals from predictions of XGBoost. The potential of these
methods is demonstrated for regression tasks. Yet, these models need to be
implemented and evaluated for binary and multi-class classification tasks.
• Both the methods, AddCBR and iXGB, are functionally evaluated in this thesis.
Besides, it is established that their performances in explanation generation
are better than those of similar methods. In future works, formal analyses of
the designs and performances of the proposed methods would establish their
completeness.
58
Bibliography
Aamodt, A., & Plaza, E. (1994). Case-Based Reasoning: Foundational Issues,

Methodological Variations, and System Approaches. AI Communications,
7 (1), 39–59.
Adadi, A., & Berrada, M. (2018). Peeking Inside the Black-Box: A Survey on
Explainable Artificial Intelligence (XAI). IEEE Access, 6, 52138–52160.
Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., & Kim,
B. (2018). Sanity Checks for Saliency Maps. Proceedings of the 32nd
International Conference on Neural Information Processing Systems
(NeurIPS), 9525–9536.
Ahmad, R. F., Malik, A. S., Kamel, N., Amin, H., Zafar, R., Qayyum, A., &
Reza, F. (2014). Discriminating the Different Human Brain States with
EEG Signals using Fractal Dimension- A Nonlinear Approach. 2014 IEEE
International Conference on Smart Instrumentation, Measurement and
Applications (ICSIMA), 1–5.
Ahmed, M. U., Barua, S., Begum, S., Islam, M. R., & Weber, R. O. (2022). When a
CBR in Hand Better than Twins in the Bush. In P. Reuss & J. Schönborn
(Eds.), Proceedings of the 4th Workshop on XCBR: Case-based Reasoning
for the Explanation of Intelligent Systems (XCBR) co-located with the 30th
International Conference on Case-Based Reasoning (ICCBR) (pp. 141–152).
CEUR.
Ahmed, M. U., Islam, M. R., Barua, S., Hök, B., Jonforsen, E., & Begum, S. (2021).
Study on Human Subjects – Influence of Stress and Alcohol in Simulated
Traffic Situations. Open Research Europe, 1, 83.
AI HLEG - European Commission. (2019). Ethics Guidelines for Trustworthy AI
(tech. rep.) [High-Level Expert Group on Artificial Intelligence (AI HLEG)].
European Commission.
Allignol, C., Barnier, N., Flener, P., & Pearson, J. (2012). Constraint Programming
for Air Traffic Management: A Survey. The Knowledge Engineering Review,
27 (3), 361–392.
Aricò, P., Borghini, G., Di Flumeri, G., Colosimo, A., Pozzi, S., & Babiloni, F. (2016).
A Passive Brain–Computer Interface Application for the Mental Workload
Assessment on Professional Air Traffic Controllers during Realistic Air Traffic
Control Tasks. In D. Coyle (Ed.), Progress in Brain Research (pp. 295–328).
Elsevier.
Aricò, P., Borghini, G., Di Flumeri, G., Colosimo, A., Bonelli, S., Golfetti, A.,
Pozzi, S., Imbert, J.-P., Granger, G., Benhacene, R., & Babiloni, F. (2016).
Adaptive Automation Triggered by EEG-Based Mental Workload Index:
59
A Passive Brain-Computer Interface Application in Realistic Air Traffic

Control Environment. Frontiers in Human Neuroscience, 10, 539.
Aricò, P., Borghini, G., Di Flumeri, G., Sciaraffa, N., Colosimo, A., &
Babiloni, F. (2017). Passive BCI in Operational Environments: Insights,
Recent Advances, and Future Trends. IEEE Transactions on Biomedical
Engineering, 64 (7), 1431–1436.
Arnott, D., & Pervan, G. (2014). A Critical Analysis of Decision Support Systems
Research Revisited: The Rise of Design Science. Journal of Information
Technology, 29 (4), 269–293.
Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik,
S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R.,
Chatila, R., & Herrera, F. (2020). Explainable Artificial Intelligence (XAI):
Concepts, Taxonomies, Opportunities and Challenges toward Responsible
AI. Information Fusion, 58, 82–115.
Barshan, E., Brunet, M.-E., & Dziugaite, G. K. (2020). RelatIF: Identifying
Explanatory Training Examples via Relative Influence. Proceedings of
the 23rd International Conferenceon Artificial Intelligence and Statistics
(AISTATS), 108.
Barua, S., Ahmed, M. U., & Begum, S. (2017). Classifying Drivers’ Cognitive Load
Using EEG Signals. In B. Blobel & W. Goossen (Eds.), Proceedings of the
14th International Conference on Wearable Micro and Nano Technologies for
Personalized Health (pHealth) (pp. 99–106). IOS Press.
Basheer, I. A., & Hajmeer, M. (2000). Artificial Neural Networks: Fundamentals,
Computing, Design, and Application. Journal of Microbiological Methods,
43 (1), 3–31.
Begum, S., & Barua, S. (2013). EEG Sensor Based Classification for Assessing
Psychological Stress. In B. Blobel, P. Pharow, & L. Parv (Eds.), Proceedings
of the 10th International Conference on Wearable Micro and Nano
Technologies for Personalized Health (pHealth) (pp. 83–88). IOS Press.
Biecek, P. (2018). DALEX: Explainers for Complex Predictive Models in R. Journal
of Machine Learning Research, 19 (84), 1–5.
Bishop, C. M., & Lasserre, J. (2007). Generative or Discriminative? Getting the
Best of Both Worlds. In J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P.
Dawid, D. Heckerman, A. F. M. Smith, & M. West (Eds.), Bayesian Statistics
(pp. 13–34). Oxford University PressOxford.
Bonett, D. G. (2006). Confidence Interval for a Coefficient of Quartile Variation.
Computational Statistics & Data Analysis, 50 (11), 2953–2957.
Breiman, L. (2001). Random Forests. Machine Learning, 45 (1), 5–32.
Charles, R. L., & Nixon, J. (2019). Measuring Mental Workload using Physiological
Measures: A Systematic Review. Applied Ergonomics, 74, 221–232.
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System.
Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD), 785–794.
Chen, Y., Li, B., Yu, H., Wu, P., & Miao, C. (2021). HyDRA: Hypergradient Data
Relevance Analysis for Interpreting Deep Neural Networks. Proceedings of
the 35th AAAI Conference on Artificial Intelligence, 35(8), 7081–7089.
60
Bibliography
Choudhary, R., & Gianey, H. K. (2017). Comprehensive Review On Supervised

Machine Learning Algorithms. 2017 International Conference on Machine
Learning and Data Science (MLDS), 37–43.
Cook, A. J., & Tanner, G. (2015). European Airline Delay Cost Reference Values
(tech. rep.). University of Westminster. London, UK.
Dalmau, R., Belkoura, S., Naessens, H., Ballerini, F., & Wagnick, S. (2019).
Improving the Predictability of Take-off Times with Machine Learning.
Proceedings of the 9th SESAR Innovation Days.
Dalmau, R., Ballerini, F., Naessens, H., Belkoura, S., & Wangnick, S. (2021).
An Explainable Machine Learning Approach to Improve Take-off Time
Predictions. Journal of Air Transport Management, 95, 102090.
Degas, A., Islam, M. R., Hurter, C., Barua, S., Rahman, H., Poudel, M., Ruscio,
D., Ahmed, M. U., Begum, S., Rahman, M. A., Bonelli, S., Cartocci, G.,
Di Flumeri, G., Borghini, G., Babiloni, F., & Aricó, P. (2022). A Survey on
Artificial Intelligence (AI) and eXplainable AI in Air Traffic Management:
Current Trends and Development with Future Research Trajectory. Applied
Sciences, 12 (3), 1295.
DeYoung, J., Jain, S., Rajani, N. F., Lehman, E., Xiong, C., Socher, R., & Wallace,
B. C. (2020). ERASER: A Benchmark to Evaluate Rationalized NLP Models.
Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics (ACL), 4443–4458.
Di Flumeri, G., Borghini, G., Aricò, P., Sciaraffa, N., Lanzi, P., Pozzi, S., Vignali, V.,
Lantieri, C., Bichicchi, A., Simone, A., & Babiloni, F. (2018). EEG-Based
Mental Workload Neurometric to Evaluate the Impact of Different Traffic
and Road Conditions in Real Driving Settings. Frontiers in Human
Neuroscience, 12, 509.
Mental Workload Assessment During Real Driving: A Taxonomic Tool for
Neuroergonomics in Highly Automated Environments. Neuroergonomics,
121–126.
Doshi-Velez, F., & Kim, B. (2017). Towards A Rigorous Science of Interpretable
Machine Learning. ArXiv, (arXiv:1702.08608v2 [stat.ML]).
Došilović, F. K., Brčić, M., & Hlupić, N. (2018). Explainable Artificial Intelligence:
A Survey. 2018 41st International Convention on Information and
Communication Technology, Electronics and Microelectronics (MIPRO),
0210–0215.
Erdi, P. (2008). Complex Systems: The Intellectual Landscape. In Complexity
Explained (pp. 1–23). Springer Berlin Heidelberg.
European Commission. (2019). Road Safety: Commission Welcomes Agreement on
New EU Rules to Help Save Lives.
Fastenmeier, W., & Gstalter, H. (2007). Driving Task Analysis as a Tool in Traffic
Safety Research and Practice. Safety Science, 45 (9), 952–979.
Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi,
D. (2019). A Survey of Methods for Explaining Black Box Models. ACM
Computing Surveys, 51 (5), 1–42.
Gunning, D. (2017). Explainable Artificial Intelligence (XAI). Defense Advanced
Research Projects Agency (DARPA), 2 (2).
61
Gunning, D., & Aha, D. W. (2019). DARPA’s Explainable Artificial Intelligence

Program. AI Magazine, 40 (2), 44–58.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene Selection for Cancer
Classification using Support Vector Machines. Machine Learning, 46 (1),
389–422.
Hoffman, R. R., Mueller, S. T., Klein, G., & Litman, J. (2023). Measures for
Explainable AI: Explanation Goodness, User Satisfaction, Mental Models,
Curiosity, Trust, and Human-AI Performance. Frontiers in Computer
Science, 5, 1096257.
Hurter, C., Degas, A., Guibert, A., Durand, N., Ferreira, A., Cavagnetto, N., Islam,
M. R., Barua, S., Ahmed, M. U., Begum, S., Bonelli, S., Cartocci, G., Di
Flumeri, G., Borghini, G., Babiloni, F., & Aricó, P. (2022). Usage of More
Transparent and Explainable Conflict Resolution Algorithm: Air Traffic
Controller Feedback. Transportation Research Procedia, 66, 270–278.
Islam, M. R., Ahmed, M. U., Barua, S., & Begum, S. (2022). A Systematic Review of
Explainable Artificial Intelligence in Terms of Different Application Domains
and Tasks. Applied Sciences, 12 (3), 1353.
Islam, M. R., Ahmed, M. U., & Begum, S. (2021). Local and Global Interpretability
using Mutual Information in Explainable Artificial Intelligence. Proceedings
of the 8th International Conference on Soft Computing & Machine
Intelligence (ISCMI), 191–195.
Islam, M. R., Ahmed, M. U., & Begum, S. (2023). Interpretable Machine Learning for
Modelling and Explaining Car Drivers’ Behaviour: An Exploratory Analysis
on Heterogeneous Data. Proceedings of the 15th International Conference on
Agents and Artificial Intelligence (ICAART), 392–404.
Islam, M. R., Barua, S., Ahmed, M. U., Begum, S., Aricò, P., Borghini, G., &
Di Flumeri, G. (2020). A Novel Mutual Information Based Feature Set
for Drivers’ Mental Workload Evaluation Using Machine Learning. Brain
Sciences, 10 (8), 551.
Islam, M. R., Barua, S., Ahmed, M. U., Begum, S., & Di Flumeri, G. (2019).
Deep Learning for Automatic EEG Feature Extraction: An Application
in Drivers’ Mental Workload Classification. In L. Longo & M. C. Leva
(Eds.), Human Mental Workload: Models and Applications. H-WORKLOAD
2019. Communications in Computer and Information Science (pp. 121–135).
Springer Nature Switzerland.
Izza, Y., Ignatiev, A., & Marques-Silva, J. (2022). On Tackling Explanation
Redundancy in Decision Trees. Journal of Artificial Intelligence Research,
75, 261–321.
Jain, A., Duin, P., & Jianchang Mao. (2000). Statistical Pattern Recognition: A
Review. IEEE Transactions on Pattern Analysis and Machine Intelligence,
22 (1), 4–37.
Järvelin, K., & Kekäläinen, J. (2002). Cumulated Gain-based Evaluation of IR
Techniques. ACM Transactions on Information Systems, 20 (4), 422–446.
Jmoona, W., Ahmed, M. U., Islam, M. R., Barua, S., Begum, S., Ferreira, A., &
Cavagnetto, N. (2023). Explaining the Unexplainable: Role of XAI for Flight
Take-Off Time Delay Prediction. In I. Maglogiannis, L. Iliadis, J. MacIntyre,
& M. Dominguez (Eds.), Artificial Intelligence Applications and Innovations.
62
Bibliography
AIAI 2023. IFIP Advances in Information and Communication Technology

(pp. 81–93). Springer Nature Switzerland.
Kenny, E. M., & Keane, M. T. (2019). Twin-Systems to Explain Artificial
Neural Networks using Case-Based Reasoning: Comparative Tests of
Feature-Weighting Methods in ANN-CBR Twins for XAI. Proceedings of
the Twenty-Eighth International Joint Conference on Artificial Intelligence
(IJCAI), 2708–2715.
Kim, B., Wattenberg, M., Gilmer, J., Cai, C. J., Wexler, J., Viégas, F., &
Sayres, R. (2018). Interpretability Beyond Feature Attribution: Quantitative
Testing with Concept Activation Vectors (TCAV). Proceedings of the 35th
International Conference on Machine Learning (ICML).
Kim, H., Yoon, D., Lee, S.-J., Kim, W., & Park, C. H. (2018). A Study on the
Cognitive Workload Characteristics according to the Ariving Behavior in
the Urban Road. 2018 International Conference on Electronics, Information,
and Communication (ICEIC), 1–4.
Kim, Y. J., Choi, S., Briceno, S., & Mavris, D. (2016). A Deep Learning Approach
to Flight Delay Prediction. 2016 IEEE/AIAA 35th Digital Avionics Systems
Conference (DASC), 1–6.
Koh, P. W., & Liang, P. (2017). Understanding Black-box Predictions via Influence
Functions. Proceedings of the 34th International Conference on Machine
Learning (ICML), 1885–1894.
Kolodner, J. L. (1992). An Introduction to Case-Based Reasoning. Artificial
Intelligence Review, 6 (1), 3–34.
Kotsiantis, S. B. (2007). Supervised Machine Learning: A Review of Classification
Techniques. Informatica, 31 (3).
Kovarik, S., Doherty, L., Korah, K., Mulligan, B., Rasool, G., Mehta, Y., Bhavsar,
P., & Paglione, M. (2020). Comparative Analysis of Machine Learning and
Statistical Methods for Aircraft Phase of Flight Prediction. Proceedings of
International Conference for Research in Air Transportation (ICART).
Lacave, C., & Díez, F. J. (2002). A Review of Explanation Methods for Bayesian
Networks. The Knowledge Engineering Review, 17 (2), 107–127.
Lacave, C., & Díez, F. J. (2004). A Review of Explanation Methods for Heuristic
Expert Systems. The Knowledge Engineering Review, 19 (2), 133–146.
Larose, D. T. (2004). K-Nearest Neighbor Algorithm. In Discovering Knowledge in
Data: An Introduction to Data Mining (1st ed., pp. 90–106). Wiley.
Letzgus, S., Wagner, P., Lederer, J., Samek, W., Muller, K.-R., & Montavon, G.
(2022). Toward Explainable Artificial Intelligence for Regression Models: A
Methodological Perspective. IEEE Signal Processing Magazine, 39 (4), 40–58.
Leyli abadi, M., & Boubezoul, A. (2021). Deep Neural Networks for Classification of
Riding Patterns: With a Focus on Explainability. Proceedings of the European
Symposium on Artificial Neural Networks, Computational Intelligence and
Machine Learning (ESANN), 481–486.
Li, J. (2019). Regression and Classification in Supervised Learning. Proceedings of
the 2nd International Conference on Computing and Big Data (ICCBD),
99–104.
Lipton, Z. C. (2018). The Mythos of Model Interpretability. Communications of the
ACM, 61 (10), 36–43.
63
Liu, Y., Khandagale, S., Khandagale, S., White, C., & Neiswanger, W. (2021).
Synthetic Benchmarks for Scientific Research in Explainable Machine
Learning. In J. Vanschoren & S. Yeung (Eds.), Proceedings of the Neural
Information Processing Systems - Track on Datasets and Benchmarks
(NeurIPS Datasets and Benchmarks).
Loyola-Gonzalez, O. (2019). Black-Box vs. White-Box: Understanding Their
Advantages and Weaknesses From a Practical Point of View. IEEE Access,
7, 154096–154113.
Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model
Predictions. Proceedings of the 31st International Conference on Neural
Information Processing Systems (NeurIPS), 4768–4777.
Luxburg, U. V., & Schölkopf, B. (2011). Statistical Learning Theory: Models,
Concepts, and Results. In Handbook of the History of Logic (pp. 651–706).
Elsevier.
Mainali, M., & Weber, R. O. (2023). What’s meant by Explainable Model: A
Scoping Review. Proceedings of the Workshop on XAI co-located with the
32nd International Joint Conference on Artificial Intelligence (IJCAI).
Man, X., & Chan, E. P. (2021). The Best Way to Select Features? Comparing MDA,
LIME, and SHAP. The Journal of Financial Data Science, 3 (1), 127–139.
Martens, D., & Provost, F. (2014). Explaining Data-Driven Document Classifications.
MIS Quarterly, 38 (1), 73–99.
Mase, J. M., Agrawal, U., Pekaslan, D., Mesgarpour, M., Chapman, P., Torres, M. T.,
& Figueredo, G. P. (2020). Capturing Uncertainty in Heavy Goods Vehicles
Driving Behaviour. 2020 IEEE 23rd International Conference on Intelligent
Transportation Systems (ITSC), 1–7.
Mercado, J. E., Rupp, M. A., Chen, J. Y. C., Barnes, M. J., Barber, D., & Procci,
K. (2016). Intelligent Agent Transparency in Human–Agent Teaming for
Multi-UxV Management. Human Factors, 58 (3), 401–415.
Miller, T. (2019). Explanation in Artificial Intelligence: Insights from the Social
Sciences. Artificial Intelligence, 267, 1–38.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.
Mothilal, R. K., Sharma, A., & Tan, C. (2020). Explaining Machine Learning
Classifiers through Diverse Counterfactual Explanations. Proceedings of the
2020 Conference on Fairness, Accountability, and Transparency (FAT*),
607–617.
Mueller, S. T., Hoffman, R. R., Clancey, W. J., Emery, A. K., & Klein, G. (2019).
Explanation in Human-AI Systems: A Literature Meta-Review Synopsis of
Key Ideas and Publications and Bibliography for Explainable AI (tech. rep.).
Defense Advanced Research Projects Agency (DARPA). Arlington, VA,
USA.
Nauta, M., Trienes, J., Pathak, S., Nguyen, E., Peters, M., Schmitt, Y., Schlötterer,
J., Van Keulen, M., & Seifert, C. (2023). From Anecdotal Evidence to
Quantitative Evaluation Methods: A Systematic Review on Evaluating
Explainable AI. ACM Computing Surveys, 55 (13s), 1–42.
Negnevitsky, M. (2004). Artificial Intelligence: A Guide to Intelligent Systems
(2nd ed.). Addison-Wesley.
64
Bibliography
Ng, A., & Jordan, M. (2001). On Discriminative vs. Generative Classifiers: A

Comparison of Logistic Regression and Naive Bayes. Advances in Neural
Information Processing Systems, 14.
Nugent, C., & Cunningham, P. (2005). A Case-Based Explanation System for
Black-Box Systems. Artificial Intelligence Review, 24 (2), 163–178.
Rai, A. (2020). Explainable AI: From Black Box to Glass Box. Journal of the Academy
of Marketing Science, 48 (1), 137–141.
Ramchoun, H., Amine, M., Idrissi, J., Ghanou, Y., & Ettaouil, M. (2016). Multilayer
Perceptron: Architecture Optimization and Training. International Journal
of Interactive Multimedia and Artificial Intelligence, 4 (1), 26.
Rebollo, J. J., & Balakrishnan, H. (2014). Characterization and Prediction of Air
Traffic Delays. Transportation Research Part C: Emerging Technologies, 44,
231–241.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”:
Explaining the Predictions of Any Classifier. Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD), 1135–1144.
Richter, M. M., & Weber, R. O. (2013). Case-Based Reasoning: A Textbook . Springer.
Saccá, V., Campolo, M., Mirarchi, D., Gambardella, A., Veltri, P., & Morabito,
F. C. (2018). On the Classification of EEG Signal by Using an SVM
Based Algorithm. In A. Esposito, M. Faudez-Zanuy, F. C. Morabito, & E.
Pasero (Eds.), Multidisciplinary Approaches to Neural Computing. Smart
Innovation, Systems and Technologies (pp. 271–278). Springer International
Publishing.
Sagi, O., & Rokach, L. (2018). Ensemble Learning: A Survey. WIREs Data Mining
and Knowledge Discovery, 8 (4), e1249.
Saha, A., Minz, V., Bonela, S., Sreeja, S. R., Chowdhury, R., & Samanta, D. (2018).
Classification of EEG Signals for Cognitive Load Estimation Using Deep
Learning Architectures. In U. S. Tiwary (Ed.), Intelligent Human Computer
Interaction. IHCI 2018. Lecture Notes in Computer Science (pp. 59–68).
Springer International Publishing.
Sam, D., Velanganni, C., & Evangelin, T. E. (2016). A Vehicle Control System using
a Time Synchronized Hybrid VANET to Reduce Road Accidents caused by
Human Error. Vehicular Communications, 6, 17–28.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2020).
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based
Localization. International Journal of Computer Vision, 128 (2), 336–359.
Shapley, L. S. (1953). A Value for n-Person Games. In H. W. Kuhn & A. W. Tucker
(Eds.), Contributions to the Theory of Games (pp. 307–318). Princeton
University Press.
Shrikumar, A., Greenside, P., & Kundaje, A. (2017). Learning Important Features
through Propagating Activation Differences. Proceedings of the 34th
International Conference on Machine Learning (ICML), 3145–3153.
Singh, A., Thakur, N., & Sharma, A. (2016). A Review of Supervised Machine
Learning Algorithms. 2016 3rd International Conference on Computing for
Sustainable Global Development, 1310–1315.
Smilkov, D., Thorat, N., Kim, B., Viégas, F., & Wattenberg, M. (2017). SmoothGrad:
Removing Noise by Adding Noise. ArXiv, (arXiv:1706.03825v1 [cs.LG]).
65
Solovey, E. T., Zec, M., Garcia Perez, E. A., Reimer, B., & Mehler, B. (2014).
Classifying Driver Workload using Physiological and Driving Performance
Data: Two Field Studies. Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (CHI), 4057–4066.
Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic Attribution for Deep
Networks. Proceedings of the 34th International Conference on Machine
Learning (ICML), 70, 3319–3328.
Tintarev, N., Rostami, S., & Smyth, B. (2018). Knowing the Unknown: Visualising
Consumption Blind-spots in Recommender Systems. Proceedings of the 33rd
Annual ACM Symposium on Applied Computing (SAC), 1396–1399.
Todd, P., & Benbasat, I. (1992). The Use of Information in Decision Making: An
Experimental Investigation of the Impact of Computer-Based Decision Aids.
MIS Quarterly, 16 (3), 373.
Tran, T.-N., Pham, D.-T., Alam, S., & Duong, V. (2020). Taxi-speed Prediction
by Spatio-temporal Graph-based Trajectory Representation and Its
Application. Proceedings of International Conference for Research in Air
Transportation (ICART).
Troncoso-García, A. R., Martínez-Ballesteros, M., Martínez-Álvarez, F., & Troncoso,
A. (2023). A New Approach based on Association Rules to Add
Explainability to Time Series Forecasting Models. Information Fusion, 94,
169–180.
Tzallas, A., Tsipouras, M., & Fotiadis, D. (2009). Epileptic Seizure Detection in
EEGs Using Time-Frequency Analysis. IEEE Transactions on Information
Technology in Biomedicine, 13 (5), 703–710.
van der Waa, J., Robeer, M., van Diggelen, J., Brinkhuis, M., & Neerincx, M. (2018).
Contrastive Explanations with Local Foil Trees. Proceedings of the Workshop
on Human Interpretability in Machine Learning (WHI) co-located with the
35th International Conference on Machine Learning (ICML).
Vapnik, V. (1991). Principles of Risk Minimization for Learning Theory. In J. Moody,
S. Hanson, & R. P. Lippmann (Eds.), Advances in Neural Information
Processing Systems. Morgan-Kaufmann.
Vilone, G., & Longo, L. (2020). Explainable Artificial Intelligence: A Systematic
Review. ArXiv, (arXiv:2006.00093v4 [cs.AI]).
Wachter, S., Mittelstadt, B., & Russell, C. (2018). Counterfactual Explanations
without Opening the Black Box: Automated Decisions and the GDPR.
Harvard Journal of Law & Technology, 31 (2), 841–887.
Wei, Z., Wu, C., Wang, X., Supratak, A., Wang, P., & Guo, Y. (2018). Using Support
Vector Machine on EEG for Advertisement Impact Assessment. Frontiers in
Neuroscience, 12.
World Medical Association. (2001). World Medical Association Declaration of
Helsinki: Ethical Principles for Medical Research Involving Human Subjects.
Bulletin of the World Health Organization, 79 (4), 373.
Xu, F., Uszkoreit, H., Du, Y., Fan, W., Zhao, D., & Zhu, J. (2019). Explainable AI:
A Brief Survey on History, Research Areas, Approaches and Challenges. In
J. Tang, M.-Y. Kan, D. Zhao, S. Li, & H. Zan (Eds.), Natural Language
Processing and Chinese Computing (pp. 563–574). Springer International
Publishing.
66
Bibliography
Yang, F., Du, M., & Hu, X. (2019). Evaluating Explanation Without Ground Truth
in Interpretable Machine Learning. ArXiv, (arXiv:1907.06831v2 [cs.LG]).
Yang, M., & Kim, B. (2019). Benchmarking Attribution Methods with Relative
Feature Importance. ArXiv, (arXiv:1907.09701 [cs.LG]).
Yang, W., Li, J., Xiong, C., & Hoi, S. C. H. (2022). MACE: An
Efficient Model-Agnostic Framework for Counterfactual Explanation. ArXiv,
(arXiv:2205.15540v1 [cs.AI]).
Yeation, W. H., Langenbrunner, J. C., Smyth, J. M., & Wortman, P. M.
(1995). Exploratory Research Synthesis: Methodological Considerations
for Addressing Limitations in Data Quality. Evaluation & the Health
Professions, 18 (3), 283–303.
Young, M., Varpio, L., Uijtdehaage, S., & Paradis, E. (2020). The Spectrum
of Inductive and Deductive Research Approaches Using Quantitative and
Qualitative Data. Academic Medicine, 95 (7), 1122–1122.
Yu, B., Guo, Z., Asian, S., Wang, H., & Chen, G. (2019). Flight Delay Prediction
for Commercial Air Transport: A Deep Learning Approach. Transportation
Research Part E: Logistics and Transportation Review, 125, 203–221.
Zar, J. H. (1972). Significance Testing of the Spearman Rank Correlation Coefficient.
Journal of the American Statistical Association, 67 (339), 578–580.
Zhang, Z., & Jung, C. (2021). GBDT-MO: Gradient-Boosted Decision Trees for
Multiple Outputs. IEEE Transactions on Neural Networks and Learning
Systems, 32 (7), 3156–3167.
Zhou, F., Alsaid, A., Blommer, M., Curry, R., Swaminathan, R., Kochhar,
D., Talamonti, W., & Tijerina, L. (2022). Predicting Driver Fatigue in
Monotonous Automated Driving with Explanation using GPBoost and
SHAP. International Journal of Human–Computer Interaction, 38 (8),
719–729.
Zhou, J., Gandomi, A. H., Chen, F., & Holzinger, A. (2021). Evaluating the Quality
of Machine Learning Explanations: A Survey on Methods and Metrics.
Electronics, 10 (5), 593.
Zhou, Y., Booth, S., Ribeiro, M. T., & Shah, J. (2022). Do Feature Attribution
Methods Correctly Attribute Features? Proceedings of the 36th AAAI
Conference on Artificial Intelligence, 36(9), 9623–9633.
67
Part II
Included Papers
69
A1
Paper A1
Deep Learning for Automatic EEG Feature

Extraction: An Application in Drivers’ Mental
Workload Classification
Islam, M. R., Barua, S., Ahmed, M. U., Begum, S. & Di Flumeri, G.

Paper A1
Deep Learning for Automatic EEG

Feature Extraction: An Application
in Drivers’ Mental Workload
Classification†
Abstract
In the pursuit of reducing traffic accidents, drivers’ mental workload (MWL)
has been considered as one of the vital aspects. To measure MWL in
different driving situations Electroencephalography (EEG) of the drivers has
been studied intensely. However, in the literature, mostly, manual analytic
methods are applied to extract and select features from the EEG signals
to quantify drivers’ MWL. Nevertheless, the amount of time and effort
required to perform prevailing feature extraction techniques leverage the
need for automated feature extraction techniques. This work investigates
deep learning (DL) algorithm to extract and select features from the EEG
signals during naturalistic driving situations. Here, to compare the DL
based and traditional feature extraction techniques, a number of classifiers
have been deployed. Results have shown that the highest value of area
under the curve of the receiver operating characteristic (AUC-ROC) is
0.94, achieved using the features extracted by convolutional neural network
autoencoder (CNN-AE) and support vector machine. Whereas, using the
features extracted by the traditional method, the highest value of AUC-ROC
is 0.78 with the multi-layer perceptron. Thus, the outcome of this study
shows that the automatic feature extraction techniques based on CNN-AE
can outperform the manual techniques in terms of classification accuracy.
† © Springer Nature Switzerland AG 2019. Reprinted, with permission, from Islam, M. R.,
Barua, S., Ahmed, M. U., Begum, S., & Di Flumeri, G. (2019). Deep Learning for Automatic
EEG Feature Extraction: An Application in Drivers’ Mental Workload Classification. In L. Longo
& M. C. Leva (Eds.), Human Mental Workload: Models and Applications. H-WORKLOAD
2019. Communications in Computer and Information Science (pp. 121–135). Springer Nature
Switzerland.
73
Keywords: Autoencoder · Convolutional Neural Networks ·

Electroencephalography · Feature Extraction · Mental Workload.
A1.1 Introduction
Driver’s mental workload (MWL) plays a crucial role on the driving performance.
Due to excessive MWL, drivers undergo a complex state of fatigue which manifests
lack of alertness and reduces performance (Kar et al., 2010). Consequently, drivers
are prone to committing more mistakes due to increased MWL. It has been revealed
that human error is the prime cause of around 72% road accidents per year (Thomas
et al., 2013). So, increased MWL of drivers during driving can produce errors leading
to fatal accidents. Driving is a complex and dynamic activity involving secondary
tasks, i.e., simultaneous cognitive, visual and spatial tasks. Diverse secondary tasks
along with natural driving in addition to different road environments increase the
MWL of drivers which lead to errors in traffic situations (Kim et al., 2018). The
alarming number of traffic accidents due to increased MWL leverages the need
of determining drivers’ MWL efficiently. Several research works have identified
mechanisms to measure drivers’ MWL while driving both in simulated and real
environments (Brookhuis & de Waard, 2010; Kar et al., 2010; Almahasneh et al.,
2015). Methods of measuring MWL can be clustered into three main classes; i)
subjective measures, i.e., NASA Task Load Index (NASA-TLX), workload profile
(WP) etc., ii) task performance measures, e.g., time to complete a task, reaction time
to secondary task etc. and iii) physiological measures, e.g., electroencephalography
(EEG), heart rate measures etc. (Moustafa et al., 2017). The latter, with respect to
traditional subjective measures, are intrinsically objective and can be gathered along
with the task without asking any additional action to the user. Also, with respect
to performance measures, physiological measures do not require as well secondary
tasks and are generally able to predict a mental impairment, while on the contrary
performance generally degrades when the user is already overloaded (Begum & Barua,
2013; Aricò et al., 2016; Aricò et al., 2017). Due to the vast availability of measuring
technology, portability and capability of indicating neural activation clearly, the
major concern of this work is the physiological measures, specifically, EEG. With
the increase of data storage and computation power data-driven machine learning
(ML) techniques have been becoming popular means of quantifying MWL from EEG
signals.
Relevant features extracted from the EEG signals are the sine qua nons for
quantifying MWL. Currently, feature extraction is done using theory driven manual
analytic methods that demand huge time and effort (Tzallas et al., 2009; Ahmad
et al., 2014). The proposed work aims at exploring a novel deep learning model
for automated feature extraction from EEG signals to reduce the time, effort
and complexity. From the literature study, it has been found that several ML
techniques have been applied to extract features from EEG automatically but a
proper comparative study on traditional and automatic feature extraction methods
have not been put forward. In this paper, a deep learning model, convolutional
neural network autoencoder (CNN-AE) is proposed for automatic feature extraction.
These automated features are evaluated with several classification algorithms and
compared with manual feature extraction technique for comparative analysis and
feature optimisation.
74
Paper A1
The rest of the paper has been organised as follows– the background of the
research domain and several related works, are described in Section A1.2. Section
A1.3 contains detailed description of the experimental setup, data collection, analysis,
feature extraction and classification techniques. Results along with the discussions
are provided in Section A1.4 and A1.5 respectively. In the conclusion, limitations
and future of this work are discussed in Section A1.6.
A1.2 Background and Related Works
Literature indicates MWL as an important aspect of assessing human performance

(Charles & Nixon, 2019), whereas driving is a complex task performed by humans
associated with several subsidiary tasks. Assessment of drivers’ performance by
quantifying MWL has been being performed for decades. There have been several
means for measuring mental workload, but physiological measures are chosen often
due to cheap and smaller technologies (Guzik & Malik, 2016). Physiological measures
include respiration, electrocardiac activities, skin conductance, blood pressure, ocular
measures, brain measures etc. Recently, Charles and Nixon stated that, brain
measures in the form of EEG has been used for measuring MWL in most of the
research works (Barua et al., 2017; Charles & Nixon, 2019). Moreover, several studies
have proven a strong correlation between MWL and EEG features both in time and
frequency domain. Features like theta and alpha wave rhythms of EEG signal over
the frontal and parietal sites respectively reflect significantly on the MWL variations
(Aricò et al., 2017; Di Flumeri et al., 2018, 2019).
Since the exploration of EEG signals, as a tool for measuring MWL, conventional
techniques of feature extraction including statistical analysis and signal processing,
have been in practice. Ahmad et al. (2014) proposed a non-linear approach of
feature extraction using fractal dimensions to determine different brain conditions
of participants. In classifying motor imagery signals, Sherwani et al. (2016) used
discrete wavelet transform analysis to extract feature from EEG signals whereas
Sakai (2013) used non-negative matrix factorisation. Several techniques with time
and frequency domain analysis have been proposed for feature extraction (Begum
et al., 2017; Barua, 2019). Tzallas et al. (2009) proposed a method of extracting
features from power spectrum density (PSD) of EEG segments by using Fourier
transformation for epileptic seizure detection. Individual alpha frequency (IAF)
analysis has been adopted in several studies to adjust features of EEG signals
(Corcoran et al., 2018). Recently, Wen and Zhang (2017) proposed a genetic
algorithm based feature search technique for multi-class epilepsy classification.
However, sufficient works have been presented on classifying MWL from EEG
signal analysis where different ML algorithms were deployed after extracting features
analytically. Use of several ML algorithms were found in the literature for classifying
MWL such as Support vector machine (SVM) (Zarjam et al., 2015; Saha et al., 2018),
k-nearest neighbours (k-NN) (Saha et al., 2018), fuzzy-c means clustering (Das et al.,
2013), multi-layer perceptron (MLP) (Zarjam et al., 2011; Saha et al., 2018), etc.
Extracting features automatically from EEG signals is a relatively new field
of research. Researchers have deployed diverse range of deep learning (DL)
algorithms, commonly termed as autoencoders (AE) to extract feature from EEG
signals both with/without preprocessing. Recently, Wen and Zhang (2018) used
75
deep convolutional neural network (CNN) for unsupervised feature learning from
EEG signals after applying data normalisation for preprocessing. To assess the
performance of their proposed model, several classification algorithms were used
to classify epilepsy patients. In several works, authors used stacked denoising
autoencoder (SDAE) (Yin & Zhang, 2016), long short-term memory (LSTM)
(Manawadu et al., 2018) and deep belief network (DBN) (Li et al., 2015) for feature
extraction after applying PSD for preprocessing. Guo et al. (2011) extracted features
by deployed genetic algorithm for classifying epilepsy with k-NN classifier. In this
approach, discrete wavelet transformation (DWT) was used for preprocessing of
raw EEG signals. Saha et al. (2018) investigated two different DL models, SDAE
and LSTM, for extracting features from EEG signals without any preprocessing.
Afterwards, MLP was used to classify cognitive load on the participants who were
asked to perform learning task. Ayata et al. (2017) and Almogbel et al. (2018), both
the research groups used CNN autoencoder (CNN-AE) for extracting features from
EEG signals for classifying arousal and MWL among participants.
Evidently, feature extraction from EEG signals using CNN-AE have been a
popular technique among researchers for classification tasks from epilepsy and MWL
domain. Moreover, several classification algorithms were further used to measure the
effectiveness of the features extracted automatically. But, to our knowledge none of
the works represented a comparative study about feature extraction through manual
analysis and automatic extraction of features using DL techniques to compare the
performance in workload classification particularly for driving situations.
A1.3 Materials and Methods
A1.3.1 Experimental Setup

The experiment was performed on a route going through urban areas at the periphery
of Bologna, Italy. There were 20 participants in this experiment. All the participants
were students of the University of Bologna, Italy, with a mean age of 24 (±1.8) years
and licensed for about 5.9 (±1) years on average. The participants were recruited for
the study on a voluntary basis. Only the male participants were selected to conduct
a study with a homogeneous experimental group. The experiment was conducted
following the principles defined in the Declaration of Helsinki of 1975 (Revised in
2000). Informed consent and authorisation to use the recorded data were signed
after the proper description of the experiment was provided to the participants.
During the experiment, a participant had to drive a car, Fiat 500L 1.3Mjt, with
diesel engine and manual transmission, along the route illustrated in Figure A1.1.
In particular, the route consisted of three laps of a circuit about 2.5 kilometres
long to be covered with the daylight. The circuit was designed on the basis of
evidences put forward in scientific literature (Verwey, 2000; Paxion et al., 2014). In
the designed circuit, there were two segments of interest in terms of road complexity
and cognitive demand– i) Easy, a straight secondary road serving residential area
with an intersection halfway with the right-of-way; ii) Hard, a major road with two
roundabouts, three lanes, high traffic capacity and serving commercial area. This
factor will be termed as “ROAD” in the following sections. Furthermore, a participant
had to drive twice a day in the circuit, once during rush hour traffic and another in
off-peak hour. This factor will be further termed as “HOUR” with two conditions
76
Paper A1
Figure A1.1: The experimental circuit is about 2.5 kilometres long along Bologna
roads. The red and yellow line along the route indicates Hard and Easy segments of
the road, respectively. The green arrow in the bottom-right corner shows the direction
of driving from the starting and finishing point. © Springer Nature Switzerland AG
2019.
Normal and Rush. This factor had been designed following the General Plan of
Urban Traffic of Bologna, Italy. Table A1.1 refers the traffic flow intensity considered
to design two experimental conditions in this study.
Table A1.1: Traffic flow intensity in the experimental area during a day retrieved
from General Plan of Urban Traffic of Bologna, Italy. © Springer Nature Switzerland
AG 2019.
Rush Hour
Total Hour Normal Hour
Transits Morning Afternoon
14h (6 ÷ 20) 12h
(1230-1330) (1630-1730)
Total 19385 2024 2066 15295
Frequency – 2024 2066 12746
At the end of every experimental procedure consisting of a driving task of three

laps twice during rush and normal hours, each participant was properly debriefed.
The order of rush and normal hour condition had been randomised among the
participants to avoid any order effect (Kirk, 2012). There were two segments in
each lap, easy and hard referring to road complexity and task difficulties. During the
whole experimental protocol physiological data in terms of brain activities through
EEG has been recorded. A detailed description on recording of EEG signals has been
given in the following sections. However, two very recent studies have been performed
by Di Flumeri et al. (2018, 2019) following the same experimental procedure.
A1.3.2 Data Collection and Processing

EEG signals have been recorded using digital monitoring BEmicro systems provided
by EBNeuro, Italy. Twelve EEG channels (F P z, AF 3, AF 4, F 3, F z, F 4, P 3, P 7,
P z, P 4, P 8 and P Oz) were used to collect the EEG signals. The channels were placed
77
on the scalp according to the 10–20 International System. The sampling frequency
was 256 Hz for recording EEG signals. All the electrodes were referenced to both the
earlobes and grounded to the Cz site. Impedance was kept below 20 kΩ. During the
experiment, no signal conditioning was done; all the EEG signal processing was done
offline. Events were recorded along with EEG signals to associate specific signals to
different road and hour conditions.
Raw EEG signals were cropped referencing the events recorded; three laps for
both Normal & Rush hours including Easy & Hard conditions. Furthermore, two
ROAD-HOUR driving situations; Easy-Normal and Hard-Rush were selected for the
classification of MWL since literature suggests that these conditions demand low
and high MWL respectively (Di Flumeri et al., 2018). Data of all the laps driven
by the participants in the Easy-Normal and the Hard-Rush conditions were used for
further analysis. EEG signals were sliced into 2s (epoch length) segments by sliding
window technique with a stride of 0.125s, keeping an overlap of 1.875s between two
continuous epochs. The windowing technique was performed to obtain a higher
number of observations in comparison with the number of variables and respecting
the condition of stationarity of the EEG signals (Elul, 1969). Specific procedures of
EEGLAB toolbox (Delorme & Makeig, 2004) have been used for slicing the recorded
EEG signals. To remove different artefacts, i.e., ocular and muscle movements etc.
from the raw EEG signals, the ARTE algorithm by Barua et al. (2018) has been
used.
A1.3.3 Feature Extraction

Two different types of feature extraction techniques, i.e., manual and automatic
were investigated in this study. In both the methods artefact handled EEG signals
have been used. Firstly, the technique following traditional practices with filtering
and signal processing methods has been used. Here, 25 relevant features were
retrieved. Further, for the other approach DL was used to extract features from EEG
signals. Here, 284 features were primarily extracted by CNN-AE. After analysing the
feature importance based on random forest (RF) classifier 124 features were used for
further tasks. Table A1.2 demonstrates the number of relevant features extracted
from different techniques followed by description of two different feature extraction
techniques.
Table A1.2: Number of features selected from different techniques. © Springer

Nature Switzerland AG 2019.
Feature Extraction Number of Features

Traditional 25
Deep Learning 124
Traditional Approach. The process of feature extraction performed in this work

is mostly motivated by the work done by Di Flumeri et al. (2018). Firstly, PSD
has been calculated for each channel of each windowed epoch of ARTE cleaned EEG
signals mentioned in Section A1.3.2. To calculate the PSD from the EEG signals,
78
Paper A1
Welch’s method (Solomon, 1991) with Blackman-Harris window function was used
on the same length of the epochs (2s, 0.5 Hz frequency resolution). In particular, only
the theta band (5–8 Hz) over the EEG frontal channels and the alpha band (8–11 Hz)
over the EEG parietal channels were considered as variables for the mental workload
evaluation (Aricò et al., 2017). Then, to define EEG frequency bands of interest,
IAF values were estimated with the algorithm developed by Corcoran et al. (2018).
Figure A1.2 illustrates the final feature vector generation for each of the observations
following the aforementioned sequence of steps.
Figure A1.2: Steps in the traditional feature extraction technique. © Springer Nature
Switzerland AG 2019.
Figure A1.3: Network architecture of the CNN-AE for feature extraction. © Springer
Nature Switzerland AG 2019.
Deep Learning Approach. The CNN-AE architecture used for automatic feature
extraction is shown in Figure A1.3. The whole network is divided into two parts, i)
encoder and ii) decoder. Encoder is comprised of a number of convolutional layer
associated with pooling layers, finds deep hidden features from original signal. On the
other hand, Decoder uses several deconvolutional layer to reconstruct the signal from
the features. To assess the performance of the encoders, the quality of reconstructed
signal from decoder is used. On the basis of this compressing and reconstructing,
79
the whole model is trained. The developed encoder in this study, consists of four
convolutional layers and four max-pooling layers. The decoder is designed in inverse
order of the encoder. It contains five convolutional layers and four upsampling layers
facilitating the depooling. Zero padding, batch normalisation and ReLU activation
function have been used in each of the layers. The developed CNN-AE utilised
RMSprop optimisation with a learning rate of 0.002 and binary cross-entropy as the
loss function. After a successful learning procedure, CNN-AE extracted 284 features
from the experimental EEG signals.
A1.3.4 Classification of MWL

After extracting features from two different methods, several classifiers were deployed
to classify MWL. Table A1.3 provides the list of classifiers and the values of their
prominent parameters.
Table A1.3: Parameters used in different classifiers. © Springer Nature Switzerland

AG 2019.
Classifier Parameter Details

Support Vector Machine (SVM) kernel = ‘rbf ’
k-Nearest Neighbout (kNN) k=5
Random Forest (RF) max_depth = 5, n_estimators = 100
Multi-layer Perceptron (MLP) hidden_layer = 100, activation = relu
Figure A1.4: Variation in classification accuracy with respect to the change of

threshold on feature importance values. The highest average accuracy, 87.30%, was
found for 0.003 (point marked with red dot) as a threshold on feature selection. ©
Springer Nature Switzerland AG 2019.
Before classifying MWL, to reduce the dimension of the feature set further, feature
importance was calculated using RF classifier. Different number of features were
selected from 284 features depending on different threshold values and deployed for
80
Paper A1
classifying MWL with SVM classifier on the training data set. It was observed that
there was variation in accuracy. Finally, by imposing 0.003 threshold on feature
importance 124 relevant features were finalised that reduced the feature set by more
than half but increased accuracy. For the both the classifiers, parameters given in
Table A1.3 were used. Figure A1.4 illustrates the change of accuracy for different
threshold values of feature importance to select features for classification.
A1.4 Result and Evaluation
All the observations with relevant features from the EEG signals were divided into
training and testing sets considering 80% and 20% of the data, respectively. The
training set was used to train the model, and the testing set was used to validate the
accuracy of MWL classification. Several common classifiers stated in Table A1.3 were
deployed to verify the effectiveness of the features obtained by the traditional method
and CNN-AE. For measuring classification performance, average overall accuracy,
balanced classification rate (BCR) or balanced accuracy and F1 score were calculated
for each of the classifiers and features extracted by different methods. Tables A1.4 and
A1.5 contain the values for performance measures of classification from traditionally
extracted features and CNN-AE extracted features, respectively. It has been observed
that features extracted from CNN-AE produced better performance measures for all
the classifiers. In particular, SVM classified MWL with the highest overall accuracy
of 87%.
Table A1.4: Average performance measures of classifiers applied on traditionally
extracted features. © Springer Nature Switzerland AG 2019.
Classifiers Accuracy BCR F1 score

SVM 0.5388 0.5388 0.5146
kNN 0.6420 0.6420 0.6486
RF 0.6414 0.6414 0.6442
MLP 0.7083 0.7083 0.7151
Table A1.5: Average performance measures of classifiers applied on features extracted

by CNN-AE. © Springer Nature Switzerland AG 2019.
Classifiers Accuracy BCR F1 score

SVM 0.8700 0.8700 0.8730
kNN 0.7737 0.7737 0.7912
RF 0.8049 0.8049 0.8197
MLP 0.8504 0.8504 0.8527
To investigate the performance of the classifiers further, Specificity and Sensitivity

were calculated and illustrated in Figure A1.5. It has been clearly visible from
the figure that both the scores for CNN-AE features were higher than traditionally
extracted features.
To establish the validity of the proposed model, 10-fold and
leave-one-participant-out cross validations were performed. Average AUC curves
81
Figure A1.5: MWL classification results in terms of Sensitivity and Specificity. ©

Springer Nature Switzerland AG 2019.
on the cross validations are illustrated in Figure A1.6 and A1.7 where the SVM
classifier has the highest AUC in both. For 10-fold cross validation, all the
observations were divided into 10 segments. Afterwards, for each iteration, one
segment was used for testing a model built on other segments as the training set. In
the leave-one-participant-out cross validation process, for each of the participants
of the experiment, the observations from that participant were used for testing
the model built on the observations from other participants considered as training
data. For both the cross validation, AUC values for CNN-AE extracted features in
classification are notably higher than the values for traditionally extracted features.
Figure A1.6: AUC-ROC curves for different classifiers with features extracted by
traditional methods and CNN-AE where models were trained using 10-fold cross
validation. © Springer Nature Switzerland AG 2019.
82
Paper A1
Figure A1.7: AUC-ROC curves for different classifiers with features extracted by
traditional methods and CNN-AE where models were trained using leave-one-out
(participant) cross validation. © Springer Nature Switzerland AG 2019.
A1.5 Discussion
In this study, traditional and CNN-AE based EEG feature extraction methods were
comparatively investigated using four well established classifiers; SVM, kNN, RF and
MLP. Among the concerned feature extraction techniques, CNN-AE influenced the
classifiers to achieve higher classification accuracy and other performance measures.
Initially, the number of features extracted from CNN-AE were substantially higher
than the features extracted through traditional methods but with feature selection
mechanism, the feature set was approximately reduced to half resulting improvement
in the accuracy measures of all classifiers. From different performance measures
demonstrated in Section A1.4, it has been shown that SVM achieves higher accuracy
in classifying MWL from EEG signals irrespective of feature extraction technique.
In case of classifier models for MWL classification used in related works, many
factors affect the performance of the model. Generally, if there remains a clear
correlation between characteristics of data and class labels, the deployed classifier
achieves higher accuracy in prediction. But, in case of MWL classification for drivers’
while driving in real life or simulator, the probability of noise being recorded with
the EEG signals is quite high due to eye movement, power signals, miscellaneous
interference etc. In practice, the noises are termed as artefacts. In traditional feature
extraction methods, removing these artefacts from data along with different inter-
and intra-individual variability require huge manual effort and processing. According
to the characteristics of deep learning, its layer can find out hidden features laid in
a data responsible of assigned labels. Here, from the results of this study it can be
established that, CNN-AE or any deep learning mechanism can produce feature set
from EEG signals, that would be equivalent or better than the feature set extracted
manually with less effort keeping aside the preprocessing and artefact handling tasks.
Primarily, the proposed CNN-AE produced an extensive set of features. An intuitive
investigation on the feature selection with RF Classifier and imposing threshold
on feature importance produced considerably shorter feature vector with higher
83
classification accuracy. Further investigation on feature selection in this domain

can produce more robust set of relevant features.
The recorded data from experimental protocol was balanced in terms of class
labels. Each of the participants attempted driving for different ROAD and HOUR
condition once. The recorded EEG signals formed the initial labelled balanced data.
For further investigation, the raw EEG signals were segmented into overlapping
epochs to increase the amount of observations keeping the core characteristics
of the data. This operation facilitated this data-driven study by increasing the
amount data with a trade-off for balanced data. Due the uneven driving duration
among the participants, the number of windowed epochs varied from participant to
participant as well as for different study factors resulting the data as an imbalanced
data. Performance measures illustrated in Section A1.4 were chosen from prescribed
measures for imbalanced data by Tharwat (2021).
A1.6 Conclusion
This paper presents a new hybrid approach for automatic feature extraction from the
EEG signals and demonstrated with MWL classification. The main contribution of
this paper can be represented in three folds: i) CNN method is used to extract features
automatically from artefact handled EEG signals, ii) RF is used for feature selection
and iii) several machine learning algorithms are used to classify drivers’ mental
workload on CNN based feature sets. This new hybrid approach is compared with
traditional feature extraction approach considering four machine learning classifiers,
i.e., SVM, kNN, RF and MLP. According to the outcome of the both 10-fold and
leave-one-participant-out cross validation, SVM outperforms other classifiers with
CNN-AE extracted features. One advantage of CNN-AE for feature extraction is that
it works directly on the artefact handled data sets, i.e., additional signal processing,
individual feature extraction etc. are not needed, thus reducing time in manual
work. More experimental work with large and heterogeneous data set is planned for
future work to increase the performance of the proposed method and extract features
directly from raw EEG signals. Moreover, classifying MWL in real time using the
proposed approach and suggesting external actions to mitigate road casualty is the
final goal of the planned research works.
Acknowledgement. This article is based on work performed in the project

BrainSafeDrive. The authors would like to acknowledge the Vetenskapsrådet (The
Swedish Research Council) for supporting the BrainSafeDrive project. The authors
are also very thankful to Prof. Fabio Babiloni of BrainSigns. They would also like to
acknowledge the extraordinary support of Dr. Gianluca Borghini & Dr. Pietro
Aricó in experimental design & data collection. Further, authors would like to
acknowledge the project students, Casper Adlerteg, Dalibor Colic & Joel Öhrling
for their contribution to test the concept.
84
Paper A1
Bibliography
Almahasneh, H., Kamel, N., Walter, N., & Malik, A. S. (2015). EEG-based Brain
Functional Connectivity during Distracted Driving. 2015 IEEE International
Conference on Signal and Image Processing Applications (ICSIPA), 274–277.
Almogbel, M. A., Dang, A. H., & Kameyama, W. (2018). EEG-Signals Based
Cognitive Workload Detection of Vehicle Driver using Deep Learning.
2018 20th International Conference on Advanced Communication Technology
(ICACT), 256–259.
Elsevier.
Engineering, 64 (7), 1431–1436.
Ayata, D., Yaslan, Y., & Kamasak, M. (2017). Multi Channel Brain EEG Signals
based Emotional Arousal Classification with Unsupervised Feature Learning
using Autoencoders. 2017 25th Signal Processing and Communications
Applications Conference (SIU), 1–4.
Barua, S. (2019). Multivariate Data Analytics to Identify Driver’s Sleepiness,
Cognitive Load, and Stress (PhD Thesis). Mälardalen University.
Barua, S., Ahmed, M. U., Ahlstrom, C., Begum, S., & Funk, P. (2018). Automated
EEG Artifact Handling With Application in Driver Monitoring. IEEE
Journal of Biomedical and Health Informatics, 22 (5), 1350–1361.
Begum, S., Barua, S., & Ahmed, M. U. (2017). In-Vehicle Stress Monitoring
Based on EEG Signal. International Journal of Engineering Research and
Applications, 07 (07), 55–71.
Brookhuis, K. A., & de Waard, D. (2010). Monitoring Drivers’ Mental Workload
in Driving Simulators using Physiological Measures. Accident Analysis &
Prevention, 42 (3), 898–903.
85
Corcoran, A. W., Alday, P. M., Schlesewsky, M., & Bornkessel-Schlesewsky, I. (2018).

Toward a Reliable, Automated Method of Individual Alpha Frequency (IAF)
Quantification. Psychophysiology, 55 (7), e13064.
Das, D., Chatterjee, D., & Sinha, A. (2013). Unsupervised Approach for Measurement
of Cognitive Load using EEG Signals. 13th IEEE International Conference
on BioInformatics and BioEngineering (BIBE), 1–6.
Delorme, A., & Makeig, S. (2004). EEGLAB: An Open Source Toolbox for Analysis
of Single-trial EEG Dynamics including Independent Component Analysis.
Journal of Neuroscience Methods, 134 (1), 9–21.
121–126.
Elul, R. (1969). Gaussian Behavior of the Electroencephalogram: Changes during
Performance of Mental Task. Science, 164 (3877), 328–331.
Guo, L., Rivero, D., Dorado, J., Munteanu, C. R., & Pazos, A. (2011). Automatic
Feature Extraction using Genetic Programming: An Application to Epileptic
EEG Classification. Expert Systems with Applications, 38 (8), 10425–10436.
Guzik, P., & Malik, M. (2016). ECG by Mobile Technologies. Journal of
Electrocardiology, 49 (6), 894–901.
Kar, S., Bhagat, M., & Routray, A. (2010). EEG Signal Analysis for the Assessment
and Quantification of Driver’s Fatigue. Transportation Research Part F:
Traffic Psychology and Behaviour, 13 (5), 297–306.
Kirk, R. E. (2012). Experimental Design. In I. B. Weiner, J. Schinka, & W. F. Velicer
(Eds.), Handbook of Psychology (2nd ed.). Wiley.
Li, X., Zhang, P., Song, D., Yu, G., Hou, Y., & Hu, B. (2015). EEG Based Emotion
Identification Using Unsupervised Deep Feature Learning. Proceedings of the
SIGIR2015 Workshop on Neuro-Physiological Methods in IR Research.
Manawadu, U. E., Kawano, T., Murata, S., Kamezaki, M., Muramatsu, J., & Sugano,
S. (2018). Multiclass Classification of Driver Perceived Workload Using Long
Short-Term Memory based Recurrent Neural Network. 2018 IEEE Intelligent
Vehicles Symposium (IV), 1–6.
Moustafa, K., Luz, S., & Longo, L. (2017). Assessment of Mental Workload:
A Comparison of Machine Learning Methods and Subjective Assessment
Techniques. In L. Longo & M. C. Leva (Eds.), Human Mental Workload:
Models and Applications. H-WORKLOAD 2017. Communications in
Computer and Information Science (pp. 30–50). Springer International
Publishing.
86
Paper A1
Paxion, J., Galy, E., & Berthelon, C. (2014). Mental Workload and Driving. Frontiers
in Psychology, 5.
Sakai, M. (2013). Kernel Nonnegative Matrix Factorization with Constraint
Increasing the Discriminability of Two Classes for the EEG Feature
Extraction. 2013 International Conference on Signal-Image Technology &
Internet-Based Systems (SITIS), 966–970.
Sherwani, F., Shanta, S., Ibrahim, B. S. K. K., & Huq, M. S. (2016). Wavelet based
Feature Extraction for Classification of Motor Imagery Signals. 2016 IEEE
EMBS Conference on Biomedical Engineering and Sciences (IECBES),
360–364.
Solomon, O. M. (1991). PSD Computations using Welch’s Method (tech. rep.). Sandia
National Laboratories. Washington, DC, USA.
Tharwat, A. (2021). Classification Assessment Methods. Applied Computing and
Informatics, 17 (1), 168–192.
Thomas, P., Morris, A., Talbot, R., & Fagerlind, H. (2013). Identifying the Causes of
Road Crashes in Europe. Annals of Advances in Automotive Medicine, 57,
13–22.
Tzallas, A., Tsipouras, M., & Fotiadis, D. (2009). Epileptic Seizure Detection in
EEGs Using Time-Frequency Analysis. IEEE Transactions on Information
Technology in Biomedicine, 13 (5), 703–710.
Verwey, W. B. (2000). On-line Driver Workload Estimation. Effects of Road Situation
and Age on Secondary Task Measures. Ergonomics, 43 (2), 187–209.
Wen, T., & Zhang, Z. (2017). Effective and Extensible Feature Extraction
Method using Genetic Algorithm-based Frequency-domain Feature Search
for Epileptic EEG Multiclassification. Medicine, 96 (19), e6879.
Wen, T., & Zhang, Z. (2018). Deep Convolution Neural Network and
Autoencoders-Based Unsupervised Feature Learning of EEG Signals. IEEE
Access, 6, 25399–25410.
Yin, Z., & Zhang, J. (2016). Recognition of Cognitive Task Load levels using single
channel EEG and Stacked Denoising Autoencoder. Proceedings of the 35th
Chinese Control Conference (CCC), 3907–3912.
Zarjam, P., Epps, J., & Chen, F. (2011). Spectral EEG Features for Evaluating
Cognitive Load. 2011 Annual International Conference of the IEEE
Engineering in Medicine and Biology Society (EMBC), 3841–3844.
Zarjam, P., Epps, J., & Lovell, N. H. (2015). Beyond Subjective Self-Rating:
EEG Signal Classification of Cognitive Workload. IEEE Transactions on
Autonomous Mental Development, 7 (4), 301–310.
87
Paper A2 A2
A Novel Mutual Information based Feature Set for

Drivers’ Mental Workload Evaluation using Machine
Learning
Islam, M. R., Barua, S., Ahmed, M. U., Begum, S., Aricò, P., Borghini, G.
& Di Flumeri, G.
Paper A2
A Novel Mutual Information Based

Feature Set for Drivers’ Mental
Workload Evaluation using Machine
Learning†
Abstract
Analysis of physiological signals, electroencephalography more specifically,
is considered a very promising technique to obtain objective measures for
mental workload evaluation, however, it requires a complex apparatus to
record, and thus, with poor usability in monitoring in-vehicle drivers’ mental
workload. This study proposes a methodology of constructing a novel mutual
information-based feature set from the fusion of electroencephalography
and vehicular signals acquired through a real driving experiment and
deployed in evaluating drivers’ mental workload. Mutual information
of electroencephalography and vehicular signals were used as the prime
factor for the fusion of features. In order to assess the reliability of the
developed feature set mental workload score prediction, classification and
event classification tasks were performed using different machine learning
models. Moreover, features extracted from electroencephalography were
used to compare the performance. In the prediction of mental workload
score, expert-defined scores were used as the target values. For classification
tasks, true labels were set from contextual information of the experiment. An
extensive evaluation of every prediction tasks was carried out using different
validation methods. In predicting the mental workload score from the
proposed feature set lowest mean absolute error was 0.09 and for classifying
mental workload highest accuracy was 94%. According to the outcome of
the study, it can be stated that the novel mutual information based features
† © 2020 by the Authors (CC BY 4.0). Reprinted from Islam, M. R., Barua, S., Ahmed, M. U.,
Begum, S., Aricò, P., Borghini, G., & Di Flumeri, G. (2020). A Novel Mutual Information Based
Feature Set for Drivers’ Mental Workload Evaluation Using Machine Learning. Brain Sciences,
10 (8), 551.
91
developed through the proposed approach can be employed to classify and

monitor in-vehicle drivers’ mental workload.
Keywords: Electroencephalography · Feature Extraction · Machine
Learning · Mental Workload · Mutual Information · Vehicular Signal.
A2.1 Introduction
Driving is a dynamic and complex set of synchronous actions including various

secondary tasks, i.e., simultaneous cognitive, spatial and visual tasks. The rapid
increase of in-vehicle systems like telematics and infotainment systems increase the
number of secondary tasks with the primary task of driving. Along with the workload
of natural driving, secondary tasks and different road environments increase the
Mental Workload (MWL) of drivers. However, an excessive in-vehicle drivers’ MWL,
eventually causing mental fatigue if prolonged over time, can lead to significantly
deteriorated driving performance and makes the driver more vulnerable to making
mistakes (Kar et al., 2010; Kim et al., 2018). A study revealed that 72% of all
the road accidents happen each year due to driver errors (Thomas et al., 2013).
The overwhelming increase in traffic fatalities due to elevated MWL forces the need
of determining in-vehicle drivers’ MWL efficiently. Researchers of diverse domains
have identified drivers’ MWL assessment mechanisms both in simulated and real
environments (Brookhuis & de Waard, 2010; Kar et al., 2010; Almahasneh et al.,
2015). Physiological measures, particularly Electroencephalography (EEG), have
been shown to be a suitable measure of MWL (Begum & Barua, 2013; Aricò,
Borghini, Di Flumeri, Colosimo, Pozzi, et al., 2016; Aricò, Borghini, Di Flumeri,
Sciaraffa, et al., 2017). On the other hand, the process of acquiring EEG signals
during natural driving requires complex equipment to be used in addition to the
in-vehicle systems. As a result, the process of in-vehicle recording of EEG is not
favorable to natural driving. At this point an approach for drivers’ MWL monitoring
that contains minimal utilization of EEG signal is a sine qua non.
Several studies have exploited the vehicular parameters such as lateral speed,
steering wheel angle, lane change, etc. as a complementary measure to EEG to obtain
insight about driver’s psycho-physiological state (Solovey et al., 2014; Rahman et al.,
2020). Also, vehicular parameters are not obstructive during driving in comparison
to EEG recording. Therefore, it would be possible to (i) utilize the association of
vehicular parameters and EEG signals in terms of Mutual Information (MI) (Cover &
Thomas, 2006) in developing a feature template establishing the combined effect on
MWL, and then (ii) this feature template can be further used to evaluate in-vehicle
driver’s MWL from the vehicular parameters, which can be easily extracted from
the built-in systems of a vehicle. More specifically, the conceived application is
to record EEG signals once for a specific driver and a specific vehicle along with
different vehicular features while driving, taking advantage of the added value of
neurophysiological data (i.e., EEG). A feature template will be created combining
the underlying characteristics of EEG and vehicular signals and thus enhancing the
statistical power of prediction models. Then, this feature template will be fed with
only vehicular data afterwards to generate online assessment of in-vehicle MWL of
the driver, thus avoiding repeated use of invasive devices for recording EEG signals
in vehicle and performing complex computations as well.
92
Paper A2
In this context, this study further investigated the possible association between
vehicular and EEG signals and their relationship with the MWL of drivers while
driving. In particular, the present work validates the fusion of mentioned signals
with the aim to develop feature set that can be used for in-vehicle drivers’ MWL
evaluation with a provision for reducing the complexity of recording EEG signals
repeatedly in the concerned tasks. The aim of this study can be outlined as:
• Develop a new feature fusion methodology for producing a “feature template”

from vehicular and EEG signals. This template can be used to generate a
feature set utilizing only vehicular signal for evaluating in-vehicle drivers’ MWL.
• Assess the reliability of the feature set developed from the proposed
methodology.
• Validate the performance of machine learning (ML) models in quantifying
and classifying drivers’ MWL using the features extracted from proposed
methodology.
The remaining sections of this article is organized as follows. The background
of the research domain and several related works are described in Section A2.2.
Section A2.3 contains detailed description of the experimental setup, data collection,
analysis, feature set generation and validation of the feature set using regression
and classification. The outcome of the performed methodologies and discussions
on different outcomes are provided in Sections A2.4 and A2.5, respectively. In
conclusion, a summary and possible future of this work are discussed in Section
A2.6.
A2.2 Background and Related Works
The task of driving is a combination of several dynamic and complex activities

that include simultaneous visual, cognitive and spatial tasks (Kim et al., 2018).
Fastenmeier and Gstalter defined driving as a human-machine system that
continuously changes with the environment. The components of the environments
are traffic flow (high or low), road layout (straight, junctions, roundabout or curves),
road design (motorways, city or rural), weather (rainy, snowy or windy), time of a day
(morning, midday or evening), etc. These components define the overall complexity
of the driving task (Fastenmeier & Gstalter, 2007). Furthermore, various studies
outline driving as a hierarchy of different tasks in three levels. Strategic tasks like
decision making constitutes the first level. On the above of strategic tasks, the
second level lies with tasks like maneuvering or reacting in response to the change
of environment, which is termed the tactical level. The third level is called the
operational level, which includes controlling the vehicle. The first two levels demand
voluntarily processing and observing various elements of environment by the drivers.
On the other hand, tasks on the third level are automatically performed depending
on the driver’s experience, which involves less processing of surrounding information.
Miscellaneous tasks associated with the primary task, i.e., controlling the vehicle,
tends to increase the MWL of drivers, which results in errors (Paxion et al., 2014;
Kim et al., 2018).
93
In the twenty-first century, driving a vehicle causes extensive irregularities in the

MWL of drivers (da Silva, 2014). With the increasing number of vehicles on the road
and in-vehicle technologies, the task of driving is getting more complex, resulting
high MWL. However, the term workload can be related to both physical and/or
mental assets and task demands. In case of driving, MWL is more appropriate and
considerably varies depending on driver’s capabilities and required task demands
(Wickens et al., 2008). It is observed that both high and low MWL can impede
the driving performance (Fisher et al., 2011). Higher MWL than normal can
lead to driver’s diverted attention, distraction, inadequate time and capacity for
information processing. On the other hand, low MWL can result slower reaction to
events, reduced attention and alertness. Thus, as complex task, driving demands
both psychological and physiological undertaking where MWL is an ineluctable
aspect (Galante et al., 2018). A study dedicated to finding the causes of road
accidents demonstrates that human error directly or indirectly contributes to 90%
of the accidents (Sam et al., 2016). Because of the association of driver’s MWL to
committing errors while driving, and since these errors have been demonstrated as a
principle contributing factor to road accidents, research on determining the in-drive
MWL of drivers looks extremely urgent and important.
A2.2.1 Assessment of Drivers’ Mental Workload

A substantial amount of research works were performed on assessing the MWL of
humans while dealing with operational activities, but most of them are concerned
about aviation sector rather than automobiles (da Silva, 2014). However, aviation
has only a small selection of pilots, which becomes easier to exploit, whereas the
automobile domain constitutes with a comparatively higher number of drivers with
diverse background, experience, skills and age group, which results in complex
research work. Generally, irrespective of the domain, MWL is assessed in different
ways. The methods can be assembled into three classes (Moustafa et al., 2017) –
1. Subjective Measures, i.e., NASA Task Load Index (NASA-TLX), workload
profile (WP), etc.
2. Task Performance Measures, i.e., time to complete a task, reaction time to
secondary task, etc.
3. Physiological Measures, i.e., EEG, heart rate (HR), etc.
In combination with the subjective measures, the physiological measures are
primarily objective in nature, which can be accumulated without imposing additional
tasks to the participant. Contrarily, gathering task performance measures requires
additional secondary tasks while driving, whereas the primary task remains already
overloaded with diverse secondary tasks. Nevertheless, physiological measures can
assess the mental impairment of the participant without imposing additional tasks
and degrading the performance on primary task (Begum & Barua, 2013; Aricò,
Borghini, Di Flumeri, Colosimo, Pozzi, et al., 2016). According to Guzik and Malik
(2016), physiological measures are selected often over other measures as a mean
of assessing MWL because of cheap and smaller technologies. Respiration, blood
pressure, skin conductance, cardiac activities, brain measures, ocular measures, etc.,
94
Paper A2
are noteworthy instances of physiological measures. An abundant accessibility of

technology, portability and capability of physiological activities, more specifically,
indication of the neural activation, EEG signals have been widely chosen by
researchers to assess the MWL of drivers while driving. In a recent review of works
on drivers’ MWL, Charles and Nixon mention that most research works are carried
out using EEG signals as a tool to measure MWL (Barua et al., 2017; Charles &
Nixon, 2019). In addition, it has been established through research that a significant
association lies between MWL and EEG features extracted in time and frequency
domain. Waveform length, zero crossings, mean absolute values, slope signs changes,
etc., features are extracted from EEG in a time domain and further utilized in
classification tasks in the domain of brain-computer interfacing (Geethanjali et al.,
2012). On the other hand, the Alpha and the Theta wave rhythms of EEG signals,
respectively, over the parietal and the frontal regions of brain significantly illustrate
the MWL variation of participants (Di Flumeri et al., 2018, 2019).
Computationally expensive methods like statistical analysis and signal processing
are largely deployed to transform the EEG signals into features that can be directly
used for measuring MWL. Literature indicates variety of approaches to extract
features from EEG signals. For example, a non-linear approach using fractal
dimensions, discrete wavelet transform, non-negative matrix factorization, time and
frequency domain analysis, etc. (Sakai, 2013; Ahmad et al., 2014; Sherwani et
al., 2016; Begum et al., 2017; Barua, 2019). Recently, the use of Deep Learning
(DL) techniques increased in this domain to reduce the complexity of adopting the
mentioned methods. A Convolutional Neural Network (CNN) was used by Wen
and Zhang (2018) for unsupervised feature learning from EEG signals in classifying
epilepsy patients. In addition to CNN, use of Long Short-Term Memory (LSTM)
(Manawadu et al., 2018), Deep Belief Network (DBN) (Li et al., 2015), Stacked
Denoising Autoencoder (SDAE) (Yin & Zhang, 2016), etc., were also observed in the
literature. After extracting features from the EEG signal, different ML algorithms are
widely used, namely, Support Vector Machine (SVM), k-Nearest Neighbors (k-NN),
Fuzzy-c Means Clustering, Multi-Layer Perceptron (MLP), etc. (Saha et al., 2018).
summarising, the prevailing methods of assessing in-vehicle MWL of drivers
require extensive setup to collect physiological signals. On top of that, complex
analysis and computation are required to extract the expected outcome let alone the
further deployment of the outcome. However, almost in all modern vehicles, there
are provisions available to record the different parameters of vehicle maneuvering,
e.g., velocity and acceleration. Solovey et al. (2014) utilized these vehicular data
aligning with physiological data to evaluate automotive user interfaces. As of now,
to our knowledge, no work has been done considering only the vehicular data in
assessing driver’s MWL. The prior work builds the foundation of this work to employ
vehicular data with pre-compiled hybrid template of vehicular and physiological data
for assessing in-vehicle MWL of drivers that may be useful to reduce the complexity
of in-vehicle setup and extensive analysis of physiological measures.
95
A2.3 Materials and Methods
A2.3.1 Experimental Protocol

This study is part of a larger study performed in real driving conditions (Di Flumeri
et al., 2018, 2019; Islam et al., 2019). Twenty male participants (24.9 ± 1.8 years old,
licensed from 5.9 ± 1 years, with a mean annual mileage of 10,350 km/year) were
recruited. They were selected in order to have a homogeneous experimental group in
terms of age, sex and driving expertise. The experiment was conducted following the
principles outlined in the Declaration of Helsinki of 1975, as revised in 2000 (World
Medical Association, 2001). Informed consent and authorization to use the video
graphical material were obtained from each subject on paper, after the explanation
of the study.
Two equal cars were used for the experiments, i.e., Fiat 500 L 1.3 Mjt, with a
diesel engine and manual transmission. The subjects had to drive the car along a
route going through urban roads at the periphery of Bologna (Italy). In particular,
the route consisted of three laps of a “circuit” about 2500 m long during the day,
with no significant darkness.
According to previous evidence in the scientific literature (Section A2.2), the
difficulty of the driving task was modulated through two variables: road complexity
and traffic intensity. In terms of road complexity, the circuit was designed with the
aim to include two segments of interest, both about 1000 m long, but different in
terms of typology and thus cognitive demand, so named hereafter “Easy” and “Hard”:
(i) Easy was a secondary road, mainly straight, with an intersection halfway with
the right-of-way, one lane and low traffic capacity, serving a residential area; (ii)
Hard was a main road, mainly straight, with two roundabouts halfway, three lanes
and high traffic capacity, serving a commercial area. This factor is hereafter named
“ROAD”. This assumption was made on the basis of several studies in the scientific
literature about road safety and behavior (Harms, 1991; Verwey, 2000; Paxion et al.,
2014).
In terms of traffic intensity, each participant had to repeat the task two times
within the same day, one time during rush and one during normal hours: this factor is
hereafter named “HOUR”, while the two conditions are named “Rush” and “Normal”.
The rush hours of that specific area have been determined according to the General
Plan of Urban Traffic of Bologna (PGTU): the two “Rush hour” time-windows were
from 12:30 to 13:30 (lunch time) and from 16:30 to 17:30 (work closing time), with the
experiments performed from 9.30 to 17.30, in order to ensure a homogeneous daylight
condition. This experimental hypothesis was statistically validated by analyzing, per
each subject per each “HOUR”, the number of vehicles encountered along the track
as well as the driving speed. The analysis is reported in a previous work obtained
from the same experiment (Di Flumeri et al., 2018).
To summarise, each subject, after a proper experimental briefing, performed a
driving task of three laps along a circuit through urban roads two times, during
Rush and Normal hours. Each lap consisted of a Hard and Easy segment, where
hard and easy refer to the road complexity, respectively a main and a secondary
road, as depicted before. The order of Rush and Normal conditions was randomized
among the subjects, in order to avoid any order effect (Kirk, 2012). Also, despite
the initial briefing, the first lap of both the tasks was considered an “adaptation lap”,
96
Paper A2
while the data recorded during the second and third laps were taken into account for
the analysis. Figure A2.1 illustrates the overview of the experimental protocol.
Figure A2.1: Summary of the experimental protocol. The experiment was carried
out with two driving tasks, which were different in terms of traffic (Normal and Rush
hour), and performed in a randomized order. Each of the driving tasks was comprised
of three laps: The 1st lap was intended to make the driver habituated to the circuit, and
the other (2nd and 3rd) laps were used for analysis. Moreover, events were introduced
in the 3rd lap to assess the presence of different scenarios on the road when they are
absent and present respectively. © 2020 by Islam et al. (CC BY 4.0).
During the whole protocol, physiological data, in terms of brain activity

through the Electroencephalographic (EEG) technique and eye gazes through
Eye-Tracking (ET) devices, and data about driving behavior (vehicular data),
through a professional device mounted on the car (i.e., a VBOX Pro), were recorded
by guaranteeing time-synchronization among the different devices. In addition,
subjective measures of perceived MWL were collected from the subjects after both
the tasks through the NASA Task Load indeX (NASA-TLX) questionnaire (Hart &
Staveland, 1988). For the purposes of the present study, only EEG and vehicular data
were considered, while eye-tracker and subjective measured were used in previous
works to validate experimental design (Di Flumeri et al., 2018, 2019).
A2.3.2 Data Collection

A2.3.2.1 EEG Data Recording and Processing
The EEG signals were recorded using the digital monitoring BEmicro system
(EBNeuro, Italy). Fifteen EEG channels (F pz, F z, P z, P Oz, Oz, AF 3, AF 4,
F 3, F 4, P 3, P 4, P 5, P 6, O1 and O2), placed according to the 10-20 International
System, were collected with a sampling frequency of 256 Hz, all referenced to both
the earlobes, grounded to the Cz site and with the impedance kept below 20 kΩ.
During the experiments, raw EEG data were recorded and the whole processing
chain was applied offline. In particular, EEG signal was firstly band-pass filtered
with a fourth-order Butterworth infinite impulse response (IIR) filter (high-pass filter
cut-off frequency: 1 Hz, low-pass filter cut-off frequency: 30 Hz). The F pz channel
was used to remove eyes-blink contributions from each channel of the EEG signal
by using the Regressive Eye BLINk Correction Algorithm (REBLINCA) algorithm
(Di Flumeri et al., 2016). This step is necessary because the eyes-blink contribution
could affect the frequency bands correlated to the MWL, in particular, the theta EEG
band. This method allows us to correct EEG signal, even online (Aricò, Borghini,
97
Di Flumeri, Colosimo, Bonelli, et al., 2016; Di Flumeri, De Crescenzio, et al., 2019)

without losing data and without requiring additional sensors, such as for example
electro-oculographic ones.
For other sources of artefacts (e.g., environmental noise and drivers’ movements),
specific procedures of the EEGLAB toolbox (Delorme & Makeig, 2004) were
employed. Firstly, the EEG signal was segmented into epochs of 2 s (Epoch
length), through moving windows shifted of 0.125 s (Shift), thus with an overlap
of 0.875 s between two contiguous epochs. This windowing was chosen with the
compromise to have both a high number of observations, in comparison with the
number of variables, and to respect the condition of stationarity of the EEG signal
(Elul, 1969). In fact, this is a necessary assumption in order to proceed with the
spectral analysis of the signal. Then, three automatic methods were applied in
order to recognize, and therefore eliminate, artefact epochs (Aricò, Borghini, Di
Flumeri, Colosimo, Pozzi, et al., 2016): (i) Threshold criterion, recognizing EEG
epochs with the signal amplitude exceeding ±100 µV; (ii) trend estimation, once
interpolated the EEG epoch, if its slope is higher than 10 µV/s, the considered epoch
is marked as “artefact”; (iii) sample-to-sample criterion, recognizing EEG epochs with
sample-to-sample differences, in terms of absolute amplitude, higher than 25 µV
(i.e., an abrupt variation – nothing physiological happened). The percentage of the
rejected data, averaged among the subjects, was 9.3% ± 11% (standard deviation).
At the end, the resulting EEG signal was considered “clean”.
From the clean EEG dataset, the Power Spectral Density (PSD) was calculated for
each EEG channel for each epoch using the Fast Fourier Transformation (FFT) and a
Hanning window of the same length of the considered epoch (2 s length, that means
0.5 Hz of frequency resolution). Then, the EEG frequency bands of interest was
defined for each subject by the estimation of the Individual Alpha Frequency (IAF)
value (Corcoran et al., 2018). In order to have a precise estimation of the alpha
peak and, hence of the IAF, a “Closed Eyes” resting condition, one minute long,
was recorded for each participant before starting the experimental tasks. Finally,
a spectral features matrix (EEG channels × Frequency bins) was obtained in the
frequency bands directly correlated to the MWL. In particular, only the theta band
[IAF − 6 ÷ IAF − 2], over the EEG frontal channels, and the alpha band [IAF −
2 ÷ IAF + 2], over the EEG parietal channels, were considered as variables for the
MWL evaluation, as demonstrated in previous scientific literature (Gevins & Smith,
2003; Borghini et al., 2014; Di Flumeri et al., 2015). In fact, the ratio between
Frontal Theta and Parietal Alpha spectral content is considered as one of the most
sensitive biomarkers of human mental workload (Gevins et al., 1998; Smith et al.,
1999; Antonenko, 2007; Lei & Roetting, 2011; Borghini et al., 2014). In particular,
in terms of features domain, it consisted of a matrix, for each subject for each epoch,
of 187 PSD values (11 EEG channels × 17 bins of frequency − from IAF-6 Hz to
IAF + 2 Hz with a resolution of 0.5 Hz −). Actually, only 99 of these features can
be selected by the algorithm, because the Regions of Interest are defined a priori: 45
features related to frontal Theta and 54 related to parietal Alpha.
A2.3.2.2 EEG-Based Mental Workload Index Computation

At this point, the automatic-stop-StepWise Linear Discriminant Analysis
(asSWLDA), a specific Machine-Learning algorithm (basically an upgraded version of
98
Paper A2
the well-known StepWise Linear Discriminant Analysis) previously developed (Aricò,

Borghini, Di Flumeri, Colosimo, Pozzi, et al., 2016), patented (Aricò, Borghini, Di
Flumeri, & Babiloni, 2017) and applied in different applications (Borghini, Aricò, Di
Flumeri, Cartocci, et al., 2017; Borghini, Aricò, Di Flumeri, Sciaraffa, et al., 2017;
Di Flumeri, De Crescenzio, et al., 2019; Di Flumeri, Borghini, Aricò, Sciaraffa, et al.,
2019) by some of the authors was employed. On the basis of the calibration dataset,
the asSWLDA was able to find the most relevant spectral features to discriminate
the MWL of the subjects during the different experimental conditions (i.e., EASY =
0 and HARD = 1). Once it identified such spectral features, the asSWLDA assigns
to each feature specific weights (wi,train ), plus a bias (btrain ), such that an eventual
discriminant function computed on the training dataset (ytrain (t)) would take the
value 1 in the hardest condition and 0 in the easiest one. This step represents the
calibration, or “Training phase” of the classifier. Later on, the weights and the bias
determined during the training phase were used to calculate the Linear Discriminant
function (ytest (t)) over the testing dataset (Testing phase), that should be comprised
between 0 (if the condition is Easy) and 1 (if the condition is Hard). Finally, a moving
average of 8 s (8MA) was applied to the ytest (t) function in order to smooth it out
by reducing the variance of the measure: its output was defined as the EEG-based
MWL index (M W LSCORE ). For the present work, the training data consisted of the
Easy segment of the 2nd lap during the “Normal” condition and the “Hard” segment
of the 2nd lap during the “Rush” condition (the two conditions were hypothesized
to be characterized by the lowest and highest MWL demand, respectively), while
the testing data consisted of the data of the 3rd lap of both the conditions. This
hypothesis was validated by previous analysis performed on the same experiment
(Di Flumeri et al., 2018).
The training asSWLDA discriminant function (Equation(A2.1), where fi,train (t)
represents the PSD matrix of the training dataset for the data window of the time
sample t, and of the ith feature), the testing one (Equation (A2.2), where fi,test (t) is
as fi,train (t) but related to the testing dataset) and the equation of the EEG-based
MWL index computed with a time-resolution of 8 s (M W LSCORE , Equation (A2.3)),
are reported.
X
ytrain (t) = wi,train · fi,train (t) + btrain (A2.1)
i
X
ytest (t) = wi,train · fi,test (t) + btrain (A2.2)
i
M W LSCORE = 8M A(ytest (t)) (A2.3)
A2.3.2.3 Vehicular Data

Each car was equipped with a Video VBOX Pro (Racelogic Ltd, Buckingham, UK), a
system able to continuously monitor the cinematic parameters of the car, integrated
with GPS data and videos coming from up to four high-resolution cameras. The
system was fixed within the car, at the center of the floor of the back seats, in order
to put it as close as possible to the car barycenter, while two cameras were fixed over
the top of the car. The system recorded car parameters (i.e., velocity, acceleration,
lateral and longitudinal acceleration) with a sampling rate of 10 Hz.
99
With the availability of the vehicular data, its nature was investigated at the
group level with respect to different traffic situations, road conditions, presence of
events and type of events. Moreover, the change in MWL of drivers were also studied
alongside and prominent trend of changes were observed. In the exploratory analysis,
comparison of mean values and two-sided Wilcoxon signed-rank tests (Wilcoxon,
1992) were performed considering the null hypothesis, H0 : there is no difference
between the observations of the two measurements and the alternate hypothesis, H1 :
the observations of the two measurements are not equal, with level of significance of
0.05. Figure A2.2 illustrates the change in drivers’ average MWL score and velocity
in different traffic hour and road conditions along with the standard deviations. A
two-sided Wilcoxon signed-rank test was used to analyze the MWL of drivers on Easy
and Hard segments of the track to test if the change in segment had a significant effect
on the MWL. Drivers’ MWL while driving on the Easy segment was lower (0.42±0.32)
compared to the Hard segment (0.51 ± 0.27); there was a statistically significant
increase in blood pressure (t = 0.0, p = 0.012). Conversely, on the Easy segment of
the track, participating drivers maintained average velocity 44.69 ± 14.21 kilometers
per hour (km/h) whereas the average velocity dropped to 37.81 ± 11.83 km/h on the
Hard segment. A two-sided Wilcoxon signed-rank test on the driving velocities of all
the participants for the Easy and Hard segments produced t = 0.0, p = 0.12, which
signifies the difference of velocity due to different road segments. A similar trend of
increasing MWL was observed while drivers drove during Normal (0.40 ± 0.26) and
Rush (0.45 ± 0.34) hours. A two-sided Wilcoxon signed-rank test on drivers’ MWL
for driving during different hours produced t = 3.0, p = 0.036, signifying the change
in MWL. On the other hand, average driving velocity during Normal hour was 42.39
± 13.70 km/h, which reduced to 40.98 ± 13.57 km/h in Rush hour. According to
the result of a two-sided Wilcoxon signed-rank test (t = 14.0, p = 0.575), there were
no significant difference between driving velocity during Normal and Rush hour.
(a) (b)
Figure A2.2: Average MWL score and velocity of nine participating drivers in different
(a) road segments and (b) traffic hours. The standard deviations are indicated, the
p-values obtained from the two-sided Wilcoxon signed-rank tests are presented and
significant values at 5% confidence interval are marked with asterisks (*). © 2020 by
Islam et al. (CC BY 4.0).
Two different events; a car and a pedestrian, were introduced during the 3rd
lap driving with a view to mimic the general road users and observe their effect on
100
Paper A2
drivers’ MWL and vehicle handling. Comparative investigation, thus considering

the third lap with respect to the second one (without any event) revealed that
MWL of drivers increase about 30%. Drivers’ average MWL with no additional
event was 0.38 ± 0.22 whereas average MWL increased to 0.48 ± 0.29 in presence
of simultaneously participating road users. A two-sided Wilcoxon signed-rank test
indicated that drivers’ MWL with no event were statistically significantly lower than
the MWL while driving with events t = 0.0, p = 0.012. On the other hand, the
average driving velocity without events was 44.33 ± 14.52 km/h and in presence of
events the average velocity was 42.77 ± 13.80 km/h which came out as statistically
insignificant (t = 9.0, p = 0.208). The difference in type of events has not affected the
average MWL of drivers, 0.44 ± 0.25 in presence of car and 0.46 ± 0.30 in presence
of pedestrian. From the outcome of the Wilcoxon test (t = 8.0, p = 0.161), the
change is also statistically insignificant. Again, the average velocity was lower in
presence of the a car 40.98 ± 15.06 km/h than a pedestrian 48.67 ± 10.53 km/h.
A two-sided Wilcoxon signed-rank test indicated that the change in average velocity
was statistically significant (t = 0.0, p = 0.012). Illustrations of the change in average
MWL and driving velocity with standard deviation due to presence and type of events
are presented in Figure A2.3.
(a) (b)
Figure A2.3: Average MWL score and velocity with standard deviation calculated
from the data of nine participating drivers with respect to events. Sub-figure (a)
illustrates the variation of MWL score and velocity with/without the presence of
events and Sub-figure (b) illustrates the effect of car and pedestrian on MWL score
and velocity. The p-values obtained from the two-sided Wilcoxon signed-rank tests are
presented and significant values at 5% confidence interval are marked with asterisks
(*). © 2020 by Islam et al. (CC BY 4.0).
The aforementioned group-level analysis on drivers’ MWL and driving velocity

demonstrated significant change in MWL due to change in the driving environment.
For the driving velocity, the two-sided Wilcoxon signed-rank tests did not produce
satisfactory p values to signify the differences except the case with different types of
events. However, the performed analysis partially contrives the need to formalize the
relationship between MWL and vehicular data. In a way to further formalization,
physiological data were used collectively with vehicular data to accumulate more
objective knowledge since vehicular data does not represent direct measures for MWL
estimation of drivers.
101
A2.3.3 Mutual Information Based Feature Extraction

To support the primary assumption on assessing MWL using mostly vehicular data,
the first part of the analyses contains extraction of MI (Cover & Thomas, 2006)
between EEG and vehicular data. In the process of extracting MI, entropy and
conditional entropy of the variables were calculated using Corollary A2.3.1.1 to
Theorem A2.3.1 (Cover & Thomas, 2006), which are subsequently presented below.
Theorem A2.3.1. Given continuous random variable X ∈ Rd representing available
variables or observations and a continuous valued random variable Y representing
class labels. The uncertainty or entropy in drawing one sample of Y at random
according to Shannon’s definition:
Z
1
H(Y ) = Ex log2 = − p(x) log2 (p(x)) dx (A2.4)
p(y) x
After having made an observation of a variable vector x, the uncertainty of the

class identity is defined in terms of the conditional density p(y|x):
Z Z
H(Y |X) = p(x) − p(y|x) log2 (p(y|x)) dy dx (A2.5)
x y
Reduction in class uncertainty after having observed the variable vector x is called
the mutual information between X and Y , same as the Kullback-Leibler divergence
between the joint density p(y, x) and its factored form p(y)p(x).
I(X, Y ) = H(Y ) − H(Y |X) (A2.6)

Z Z
p(y, x)
= p(y, x) log2 dx dy (A2.7)
y x p(y) p(x)
We derived a template for producing feature set using solely vehicular signal using
Corollary A2.3.1.1 which was derived from Theorem A2.3.1.
Corollary A2.3.1.1. Given a continuous random variable E representing EEG
observations and a continuous random variable V representing vehicular signals,
from a specific population distribution and representing the objective and indirect
measure of MWL, respectively. The mutual information I(E, V ) between variables E
and V represents the mutual dependency between them by quantifying the amount of
information they share collectively for estimating MWL, which can be derived using
corresponding variable vectors e and v.
In association with the Corollary A2.3.1.1, for better visualization, Figure A2.4
illustrates the concept of MI with respect to the variables used in this study. E
for EEG and V for vehicular data are depicting X and Y as described in Theorem
A2.3.1. Entropy value for vehicular data and EEG are represented with H(V ) and
H(E). Joint entropy H(E, V ) consists of the union of the entropy spaces and mutual
information I(E, V ) in the intersecting space. Thus, H(E, V ) = H(E) + H(V ) −
I(E, V ) is derived using Set Theory. e and v represent a single instance of EEG and
vehicular signal, respectively, and m represents a single instance of I(E, V ), which
is the mutual information shared by single instances e and v. Formally, I(E, V ) is a
102
Paper A2
matrix of order p × q, where p and q are the numbers of vehicular and EEG features
respectively. Each row of the matrix represents the shared information between a
single vehicular feature and every EEG feature. Furthermore, ||I(E, V )||, the norm
of each row of I(E, V ) was calculated which is a vector containing the collective
magnitude of the shared information between each vehicular feature and all EEG
features. The ||I(E, V )|| was further used to calculate new MI-based feature vector
m′ from vehicular features entirely with the following Equation A2.8 where v ′ is a
new instance vector of vehicular features.
m′ = v ′ · ||I(E, V )|| (A2.8)
Figure A2.4: Illustration of shared information between EEG and vehicular signal
spaces. © 2020 by Islam et al. (CC BY 4.0).
In the extraction of MI-based features from the data of this particular study,
data were represented in vector forms, i.e., e for EEG and v vehicular data, which
belongs to the domains E and V , respectively. Formally, E, V ∈ Rd , where d bears
45 and 4, respectively, for this study. For this specific analysis, the EEG signal
was analyzed again. In fact, in the previous section of the study we employed a
well-established approach, even patented (Aricò, Borghini, Di Flumeri, & Babiloni,
2017), to obtain the EEG-based MWL reference measurements (Di Flumeri et
al., 2015; Aricò, Borghini, Di Flumeri, Colosimo, Pozzi, et al., 2016; Di Flumeri
et al., 2018). In that case, a specific a priori hypothesis (only frontal Theta
and parietal Alpha features) and processing procedures (e.g., automatic artifacts
correction/removal) were necessary for the classification algorithm reliability and
the possibility of employing it even online (Aricò, Borghini, Di Flumeri, Colosimo,
Bonelli, et al., 2016). In this second analysis, because of the absence of these
restrictions, we preferred to employ more complex artifacts rejection algorithms
and to enlarge features domain all the EEG channels throughout the scalp were
considered while extracting the features. At first the raw EEG data were cleaned,
i.e., the artefacts were removed using ARTE (Automated aRTifacts handling in
EEG) (Barua et al., 2018) and subsequently, 45 features were extracted from power
spectral density values. The IAF value was determined as the peak of the general
alpha rhythm frequency (8–12 Hz). Subsequently, the average frequency of the theta
band [IAF − 6, IAF − 2], the alpha band [IAF − 2, IAF + 2] and the beta band
[IAF + 2, IAF + 18], over all the EEG channels were calculated. Table A2.1 shows
the mapping between the features are frequency rhythms. On the other hand, the
vehicular signal was resampled to the sampling frequency of the EEG signals in order
103
to synchronize and generate equal number of data points to analyze. The steps of the
process are as follows: the vehicular signal at 10 Hz was at first upsampled by 256.
After that, a zero-phase low-pass finite impulse response (FIR) filter was applied and
then the signal was downsampled by 10. As a result, the resulting sample rate became
256 Hz, i.e., 256/10 times the original sample rate 10 Hz. The vehicular feature set
contains the values for velocity, acceleration, lateral and longitudinal acceleration
signals. Finally, values of all the features gathered from vehicular and EEG signals
were normalized with the min–max feature scaling within the range 0 to 1, in order
to restrict the ML algorithms to pick up unimportant characteristics from the data
due to difference in values of different features.
Table A2.1: Mapping among different EEG channels, three significant frequency
rhythms and identifications (ID) of features. Each row represents the IDs of the features
extracted from specific frequency rhythms from the EEG channels mentioned in the
table head. © 2020 by Islam et al. (CC BY 4.0).
Rythms F P z F z P z P Oz Oz AF 3 AF 4 F3 F4 P3 P4 P5 P6 O1 O2
theta (θ) 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
alpha (α) 2 5 8 11 14 17 20 23 26 29 32 35 38 41 44
beta (β) 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45
Considering all of the available vehicular features and the calculated features
from EEG signal, MI values were calculated using Equation A2.7. The associated
MI values, illustrated in Figure A2.5, demonstrates the shared knowledge between
vehicular data and EEG data. Though the range of the MI values are not significant,
yet they share some information, which leverages the motivation to use MI values in
further classification or quantification of MWL in this work. Finally, an MI-based
feature set was constructed using Equation A2.8. Table A2.2 represents the number
of features from different feature sets which were considered in further stages of this
study. Here, the prime concern of the study is to investigate the performance of
MI-based features in MWL assessment and EEG features are used as an established
objective measure reference.
Figure A2.5: Calculated MI values between EEG and vehicular signal. The columns
of the matrix correspond to 45 features extracted from EEG signals, and the rows
correspond to four vehicular features: Velocity (Velo), Acceleration (Acce), Lateral
Acceleration (LatA) and Longitudinal Acceleration (LonA). The colour bar below
illustrates the range of values for each pair of EEG and vehicular features, where dark
blue on the left corresponds to low mutual information and gradually higher mutual
information values towards the right are represented by yellow. © 2020 by Islam et al.
(CC BY 4.0).
104
Paper A2
Table A2.2: List of different feature sets and corresponding number of features used
for validating the proposed methodology. © 2020 by Islam et al. (CC BY 4.0).
Feature Set Number of Features

EEG-based 45
MI-based 4
A2.3.4 Prediction and Classification Models

To evaluate the MWL of drivers from features developed with MI between EEG and
vehicular features and compare its performance while assessed with solely EEG-based
feature set, ML algorithms of a different nature from a functional point-of-view were
trained. During the prediction task, expert-defined MWL scores (Section A2.3.2.2)
were used as true predictions to train the regression models. On the other hand,
for the classification tasks, two sets of binary classes were considered. In terms
of MWL classification, data instances were labeled as High and Low following the
factors, “ROAD” and “HOUR” described in Section A2.3.1. To examine the use of the
extracted MI-based features in classification tasks other than MWL, another binary
classification task was performed assuming the two events; Car and Pedestrian, which
were introduced in Lap 3 during the experiment as true labels. The ML algorithms,
which were used in different prediction and classification tasks, are described briefly
below.
Regression is the simplest supervised ML model that estimates the relationship
between an independent and a dependent variable with statistical analyses
(Freedman, 2009). Generally, Linear Regression (LnR) and Logistic Regression (LgR)
are deployed for predicting continuous and binary categorical values, respectively,
which aligns perfectly with this study. For both regression and classification tasks,
normalized data were used. Moreover, for classification, LgR was performed with
balanced class weights and L2 regularization.
Multilayer Perceptron (MLP) (Hastie et al., 2009) is a subclass of Artificial Neural
Network (ANN) with at least three layers of nodes – an input layer, hidden layer and
output layer. Here, MLPs were trained for both classification and regression tasks
with three hidden layers of 32, 16 and 4 nodes, respectively, Rectified Linear Unit
(ReLU) activation, Adam optimizer and batch size 128.
Random Forest (RF) is an ensemble method, which builds a collection of
randomized decision-trees developed from bootstrapped data points and predicts
on the basis of majority voting from all the trees for classification tasks (Breiman,
2001) whereas for regression tasks, it takes the average of prediction. In addition
to that, RF operates with an underlying feature selection method which removes
non-important features for prediction tasks automatically. RF was implemented
using bootstrapping as the ensemble method.
The working principle of Support Vector Machine (SVM) concentrates mostly on
finding the hyper-plane, which simultaneously minimizes the empirical classification
error and maximizes the geometric margins in the classification tasks (Guyon et
al., 2002). SVM transforms the true data points from the input space to high
dimensional space that facilitates the classification task by determining a decision
boundary. For prediction or regression tasks, the decision boundary is used to predict
105
the continuous value or target value. SVM-based regression and classification models
have a very good generalization capability on multidimensional data and dynamic
classification/prediction scheme, which makes them appropriate for the concerned
tasks. Moreover, literature shows deliberate use of SVM in the domain of EEG
signal analysis and MWL assessment (Saccá et al., 2018; Saha et al., 2018; Wei et
al., 2018). In this study, for all tasks, the SVM was configured with Radial Basis
Function (RBF) kernel with degree 3. By trial and error, the final regularization
parameter C was set to 1.0 and epsilon to 0.2 as the model parameters.
The trained ML models were further deployed in performing different tasks to
evaluate the MI-based features. The model parameters used to train different models
for respective tasks are summarised in Table A2.3.
Table A2.3: Parameters used in building different models for prediction and
classification tasks. © 2020 by Islam et al. (CC BY 4.0).
ML Models Parameter Details Task

Intercept fit: True
Linear Regression (LnR) Prediction
Normalize: True
Intercept fit: True
Logistic Regression Normalize: True
Classification
(LgR) Class weight: Balanced
Regularization: L2
Hidden layers: 32, 16, 4
Multilayer Perceptron Activation: ReLU Prediction &
(MLP) Optimizer: Adam Classification
Batch size: 128
Estimators: 100
Prediction &
Random Forest (RF) Bootstap: True
Classification
Maximum depth: 5
Kernel: RBF
Support Vector Machine Degree: 3 Prediction &
(SVM) C: 1.0 Classification
Epsilon: 0.2
The evaluation of the features extracted through proposed approach was

conducted in several steps: (1) predicting MWL score, (2) classifying MWL and
(3) classifying events. For the prediction task, the evaluation was performed using
10-fold Cross Validation (CV) and Leave-One-Out (LOO) (subject) validation. In the
process of 10-fold CV, the whole dataset was divided into 10 equal sets. After that,
10 iterations of training and testing of aforementioned ML models were performed
considering each of the divided sets as test set and rest nine sets as training set.
So, the ratio between training and validation was 90% and 10%. The repetition
of the experiment was conducted 10 times and average results are presented in the
manuscript. On the other hand, in LOO-subject validation, there were 9 (number
of subjects) iterations. In each iteration, the training set consisted of 8 participants
and the left-out subject’s data was taken as a test set. In both of the validation
approach the average split of training and testing ratio was approximately 90:10.
In the case of the classification task, 10% of all data points were selected through
106
Paper A2
stratified sampling as a holdout test set. The rest of the data were further used for
training and validating the models using the two described validation methods with
a view to flag problems like overfitting or selection bias.
The tasks of implementation of the proposed methodology and representation of
result were done using Python (van Rossum, 1995) and R (R Core Team, 2013)
environments. Python libraries NumPy (Travis, 2015) and Pandas (McKinney,
2010) were invoked for preparing the data. ML models were trained, validated and
tested using the Scikit Learn (Pedregosa et al., 2011) library for Python. The plots
and graphs were drawn utilizing different methods of Matplotlib (Hunter, 2007).
Statistical tests were conducted mostly using methods from SciPy (Virtanen et al.,
2020) library for Python and pROC (Robin et al., 2011) package for R.
A2.4 Results
The outcome of the performed study is presented from the viewpoint of two different
tasks: prediction and classification. In the process, the developed prediction models
were evaluated using Mean Absolute Error (MAE) and Mean Standard Error (MSE).
The evaluation of the developed MWL and event classifiers were done in terms
of confusion matrices, Receiver Operating Characteristic (ROC) curves, accuracy,
sensitivity and specificity. In addition to the mentioned performance measures,
balanced accuracy was also measured since both of the classification task of this study
were binary classification and due to division of epochs from the signal recordings
and duration of driving, the number of instances representing each class varied to
some extent.
A2.4.1 Quantification of Drivers’ Mental Workload

Four different prediction models LnR, MLP, RF and SVM were trained with expert
defined MWL scores against EEG and MI based features. The performance of the
models were validated with 10-fold CV approach. Figures A2.6 and A2.7 illustrate
the MAE values for 10 folds of validation sets in predicting MWL scores from EEG
and MI based feature set. An overview of the prediction scores of each model on two
different feature sets are provided in Table A2.4 from the performed CV.
A2.4.2 Drivers’ Mental Workload and Event Classification

Primarily, the MI-based feature set was tested in MWL classification against the
EEG-based feature set with the respective models described in Table A2.3. For MWL
classification, the Low MWL was considered a positive class and High MWL was
considered a negative class. In addition to MWL classification, the event classification
tasks were performed to establish the use of MI-based features in other classification
tasks, which was inspired from the result obtained in MWL classification. In event
classification, Car and Pedestrian events were defined as the positive and negative
classes, respectively, to measure the performance. For both the classification tasks,
10-fold CV and LOO-subject CV were used to train the models on different feature
sets. The models used in MWL classification were used to train with the labels of
events keeping the model parameters unchanged with a view to conduct comparative
assessment.
107
(a) (b)
Figure A2.6: The 10-fold Cross Validation (CV) score in terms of Mean Absolute
Error (MAE) for regression models: (a) Linear Regression (LnR) and (b) Multilayer
Perceptron (MLP), where the expert derived MWL scores were considered as true
values. For each of the models, two different sets of features were used (Table A2.2).
© 2020 by Islam et al. (CC BY 4.0).
(a) (b)
Figure A2.7: The 10-fold CV score in terms of MAE for regression models: (a)
Random Forest (RF) and (b) Support Vector Machine (SVM), where the expert derived
MWL scores were considered as true values. For each of the models, two different sets
of features were used (Table A2.2). © 2020 by Islam et al. (CC BY 4.0).
In order to evaluate the classification performance of the aforementioned

classifiers, a one-sided Wilcoxon signed-rank test (Wilcoxon, 1992) was performed.
For a single classifier, the two sets of performance measures of classification where
the MI-based features and the EEG-based features were considered and the test was
conducted with the null hypothesis, H0 : there is no difference in average performance
measures of a classifier when trained with MI-based and EEG-based features, and the
alternate hypothesis, H1 : the average performance measures of the classifier trained
with MI-based features are higher than trained with EEG-based features. The test
hypotheses are mathematically outlined in the expressions below.
H0 : µM I = µEEG (A2.9)
H1 : µM I > µEEG (A2.10)
108
Paper A2
Table A2.4: The 10-fold CV summary in terms of Mean Absolute Error and
Mean Squared Error for predicting MWL score using EEG and Mutual Information
(MI)-based features. © 2020 by Islam et al. (CC BY 4.0).
MAE MSE
Model Features
Minimum Maximum Average Minimum Maximum Average
EEG-based 0.11 0.22 0.16 0.02 0.07 0.04
LnR
MI-based 0.09 0.23 0.16 0.02 0.07 0.04
EEG-based 0.09 0.22 0.17 0.02 0.07 0.04
MLP
MI-based 0.10 0.22 0.16 0.02 0.06 0.04
EEG-based 0.11 0.22 0.16 0.02 0.07 0.04
RF
MI-based 0.10 0.22 0.16 0.02 0.07 0.04
EEG-based 0.12 0.23 0.17 0.03 0.07 0.04
SVM
MI-based 0.11 0.21 0.17 0.02 0.06 0.04
The result is summarised in Table A2.5, where it can be observed that, while
classifying MWL, only SVM achieved significantly higher performance while trained
with the MI-based features. On the other hand, all the classifiers performed better
while trained with MI-based features than EEG-based features in classifying events.
Table A2.5: Summary of one-sided Wilcoxon signed-rank tests (Wilcoxon, 1992) on

the average performance in 10-fold CV of classification tasks by different classifiers
trained with MI and EEG based features. The significant values, i.e., p < 0.05, are
marked with asterisks (*). © 2020 by Islam et al. (CC BY 4.0).
Classifiers
Tasks LgR MLP RF SVM
t p t p t p t p
MWL Classification 0.0 0.994 9.0 0.896 0.0 0.994 36.0 0.006*
Event Classification 32.0 0.025* 36.0 0.006* 33.0 0.018* 36.0 0.006*
ROC curves associated with Area Under the Curve (AUC) values, for both of
the classification tasks are illustrated in Figure A2.8. The ROC curves were drawn
for the holdout test set. In both of the tasks, from the overall perspective, RF
classifier outperformed other classifiers with both feature sets in terms of AUC values.
Specifically, in MWL classification, the accuracy was higher for using EEG-based
feature set but in event classification MI-based feature set produced higher AUC
value.
In addition to the calculated AUC values from different performance metrics,
95% Confidence Interval (CI) of true AUC, Z and p values were extracted from
Delong’s test (DeLong et al., 1988) of comparing AUC values. To conduct the test,
the null hypothesis was set as H0 and the alternative hypothesis, H1 : the values of
AUC for classifiers trained on MI-based features are higher than the values of AUC
for classifiers trained on EEG-based features. Table A2.6 presents the results of
DeLong’s test, which is similar to the results obtained from the one-sided Wilcoxon
signed-rank test outlined in Table A2.5 in terms of rejecting the null hypothesis H0
with significance level 0.05.
109
(a) (b)
Figure A2.8: Receiver Operating Characteristic (ROC) curves for the best two
classifier models among Logistic Regression (LgR), MLP, SVM and RF. The classifiers
were deployed in two different binary classification tasks: (a) Low or High MWL and
(b) type of events – Car or Pedestrian. For each of the tasks, all the classifier models
were trained using 10-fold cross validation approach. © 2020 by Islam et al. (CC BY
4.0).
Table A2.6: Summary of DeLong’s test (DeLong et al., 1988) to compare Area
Under the Curve (AUC) values at significance level 0.05 (5.00 × 10−2 ). The values
were summarised for LgR, MLP, RF and SVM classifiers in different classification tasks
on the holdout test set. The significant values, i.e., p < 0.05, are marked with (*). ©
2020 by Islam et al. (CC BY 4.0).
MWL Classification Event Classification

Model Features
AUC 95% CI Z p AUC 95% CI Z p
MI-based 0.70 0.68–0.73 0.76 0.73–0.80
LnR -1.727 9.58 × 10−1 4.005 3.10 × 10−5 *
EEG-based 0.73 0.71–0.75 0.65 0.61–0.69
MI-based 0.86 0.84–0.87 0.90 0.88–0.92
MLP -0.212 5.84 × 10−1 7.606 1.42 × 10−14 *
EEG-based 0.86 0.84–0.88 0.73 0.69–0.77
MI-based 0.92 0.90–0.93 0 0.98 0.98–0.99
RF -5.540 1.00 × 10 4.060 2.44 × 10−5 *
EEG-based 0.96 0.95–0.97 0.94 0.93–0.96
MI-based 0.82 0.80–0.84 −16 0.87 0.85–0.90
SVM 8.715 2.2 × 10 * 3.096 9.84 × 10−4 *
EEG-based 0.69 0.67–0.72 0.80 0.77–0.84
The test classification report for MWL classification is presented in Table A2.7. In
addition to that, Table A2.8 provides the classification report on the holdout test set,
which demonstrates improvements in performance accuracy for classification using
MI-based features. To assess the solitary performance of classifiers trained with
MI-based features, the maximum accuracy achieved in different CV approach over
all the data splits were investigated. Figure A2.9 illustrates bar charts developed
with the maximum accuracy achieved by different classifiers in classifying MWL and
events with MI-based features. It can be observed that, in 10-fold CV, the highest
accuracy was 92.15% from RF classifier, whereas in event classification, SVM achieved
91.14%, which is the highest of all other classifiers while considering LOO-subject
CV.
110
Paper A2
Table A2.7: Performance summary of classifying Low and High MWL with LgR,
MLP, SVM and RF classifier models using EEG and MI-based feature on the holdout
test set. In this task, the total number of observations was 1710, where low MWL was
considered as the positive class. The number of observations with positive and negative
class were 917 and 793, respectively. The highest accuracies obtained by using different
feature sets are marked with (*). © 2020 by Islam et al. (CC BY 4.0).
Using EEG-Based Features Using MI-Based Features

Criteria
LgR MLP SVM RF LgR MLP SVM RF
True Positive 736 776 342 864 688 715 576 783
False Negative 181 141 575 53 229 202 341 134
False Positive 362 293 138 157 410 230 146 175
True Negative 431 500 655 636 383 563 647 618
Sensitivity 0.80 0.85 0.37 0.94 0.75 0.78 0.63 0.85
Specificity 0.54 0.63 0.83 0.80 0.48 0.71 0.81 0.78
Precision 0.67 0.73 0.71 0.85 0.63 0.76 0.80 0.82
Recall 0.80 0.85 0.37 0.94 0.75 0.78 0.63 0.85
F1 score 0.73 0.78 0.50 0.89 0.68 0.77 0.70 0.84
Accuracy 0.68 0.75 0.58 0.88* 0.63 0.75 0.72 0.82*
Balanced Accuracy 0.67 0.74 0.60 0.87* 0.62 0.74 0.72 0.82*
(a) (b)
Figure A2.9: Maximum balanced accuracy in different CV method for MWL and
event classification using MI-based features by different classifier models: (a) 10-fold
CV and (b) Leave-One-Out (LOO)-subject CV. © 2020 by Islam et al. (CC BY 4.0).
A2.5 Discussion
An increase of secondary tasks, e.g., reaching for the mobile phone, interacting
with the mobile phone (touching on the screen, dialing and texting), talking,
reading the screen, glancing at the phone momentarily and talking or listening
to a hands-free device together with the primary task of driving causes increased
MWL. According to the state-of-the-art (SotA) approaches, to measure MWL,
Electroencephalography (EEG) has been proven to be a good parameter and widely
used in research (Begum & Barua, 2013; Aricò, Borghini, Di Flumeri, Colosimo,
Pozzi, et al., 2016; Aricò, Borghini, Di Flumeri, Sciaraffa, et al., 2017), although it is
not feasible enough in terms of data acquiring, processing and decision making while
111
Table A2.8: Performance summary of classifying Car and Pedestrian events with LgR,
MLP, SVM and RF classifier models using EEG and MI-based feature on the holdout
test set among 738 observations where events due to pedestrian were considered as
positive class. The number of observations with positive and negative class were 241
and 497 respectively. The highest accuracies obtained by using different feature sets
are marked with (*). © 2020 by Islam et al. (CC BY 4.0).
Using EEG-Based Features Using MI-Based Features

Criteria
LgR MLP SVM RF LgR MLP SVM RF
True Positive 17 83 186 147 75 140 207 209
False Negative 224 158 55 94 166 101 34 32
False Positive 14 62 164 5 55 27 128 12
True Negative 483 435 333 492 442 470 369 485
Sensitivity 0.07 0.34 0.77 0.61 0.31 0.58 0.86 0.87
Specificity 0.97 0.88 0.67 0.99 0.89 0.95 0.74 0.98
Precision 0.55 0.57 0.53 0.97 0.58 0.84 0.62 0.95
Recall 0.07 0.34 0.77 0.61 0.31 0.58 0.86 0.87
F1 score 0.13 0.43 0.63 0.75 0.40 0.69 0.72 0.90
Accuracy 0.68 0.70 0.70 0.87* 0.70 0.83 0.78 0.94*
Balanced Accuracy 0.52 0.61 0.72 0.80* 0.60 0.76 0.80 0.92*
driving a car in naturalistic environment. So, the aim of this study is to perform
research and development to identify a methodology for constructing a novel mutual
information-based feature set from the fusion of electroencephalography and vehicular
signals and deployed in evaluating drivers’ mental workloads. In this study, EEG
and vehicular signals were recorded through driving experiment in real scenarios
that varies in different factors; “HOUR” and “ROAD” (Di Flumeri et al., 2018).
Here, two different events were also introduced to investigate the effects on drivers’
MWL. Since the experiment was conducted in a real environment, there might be
the presence/absence of other road users. The events leveraged the provision for
analyzing uniformly for all participants the effect of specific road users other than
the regular traffic on the road. According to the initial data analysis at group level, it
was observed that different situations and road users affect the MWL of drivers and
their vehicle handling. The results from the observation (Section A2.3.2) confirmed
the experimental hypothesis, i.e., “the driving task in terms of road complexity as well
as events induced differences in driving behaviours and drivers’ experienced MWL”.
Statistical hypothesis tests were conducted on average driving velocity and drivers’
MWL and significant (p < 0.05) differences were observed. The tests are described in
details in Section A2.3.2.3. In addition to that, several comparative plots were drawn
to assess the effects visually, which are illustrated in Figures A2.2 and A2.3. In short,
the comparisons pointed out that MWL and vehicle handling both changes when the
road condition or events on the road are altered. However, the effects of change in
events on MWL and driving behaviours are stronger than change in road condition.
These findings and together with prior literature review on use of advantages and
disadvantages of EEG features as a measure of MWL produced the base of further
analysis and increase the urge to utilize mostly vehicular features in association to
EEG for evaluating MWL of drivers.
112
Paper A2
To combine EEG features and vehicular features, a correlation between them were
calculated and the assessed values of the correlation coefficients were negligible. On
the contrary, prior investigations on the average driving velocity and MWL (Section
A2.3.2) showed changes while driving environments were varied (Section A2.3.2.3).
Thus, the motivation of exploiting MI between EEG and vehicular signal developed
entirely on the low correlation coefficient and conversely significant similarity in
the change of MWL and vehicular signal. Furthermore, the new novel concept of
utilizing MI was proposed. Here, the reference values of MI between two continuous
variables should be in the range [1, ∞] (Cover & Thomas, 2006). The MI is calculated
based on the relation between EEG and vehicular features where the average value
was found to be approximately 8.5, which is very low but not null. The data for
this study were recorded from a specific experiment from some specific participants,
which represented their brain activity and vehicle handling together for the respective
population distribution. However, The low MI values could be derived due to a
smaller number of vehicular features. Despite the fact that the MI values were low,
in MWL evaluation, the proposed features in some cases outperformed established
objective measures. If there were more vehicular features, there could be wider variety
of ways to mimic the handling of vehicle by the participants. As a result, systems
would attain higher performance in MWL evaluation. Experiments are underway to
increase the number of vehicular features by adding other parameters from inertial
measurement unit (IMU) devices.
One of the objectives of this study was to quantify MWL of drivers from the
proposed feature set. To test the performance of using the proposed feature set,
four different ML regression methods were investigated: LnR, MLP, RF and SVM,
considering the MWL score extracted by expert-defined methods as true values. For
the regression, the true values of MWL score fall in the range [0, 1], where 0 represents
no MWL and 1 represents highest from individual point of view (Di Flumeri et
al., 2018). For each of the regression models, the average MAE and MSE were
around 0.16 and 0.04 (Table A2.4). Again, these errors were compared with the
results of regression models trained using EEG-based features. In comparison, using
different features produced approximately similar errors while predicting MWL scores
of drivers and the comparison of MAE in 10-fold CV is illustrated in Figures A2.6
and A2.7. From the visualizations it was observed that the difference in average error
from RF regression model was lowest among the considered models, which might be
an effect of functional differences in terms of ensemble technique (Breiman, 2001), as
described in Section A2.3.4.
In addition to MWL quantification, the performances of MWL and event
classification using MI-based features were also examined against EEG-based
features. Classifier-wise average performance on MWL and event classification was
tested using a one-sided Wilcoxon signed-rank test (Wilcoxon, 1992). Unlike MWL
quantification, the average performance of SVM classifier with MI-based feature set
was significantly higher in both classification tasks (Table A2.5). According to Shah,
SVM is the most widely-used algorithm for classification tasks on the basis of features
extracted from EEG signals (Saha et al., 2018). The initial finding of this study
aligns with the statement. On the other hand, the other three classifiers: LgR, MLP
and RF performed better in event classification with MI-based features. To access
the correct binary classification capacity, AUC-ROC curves were plotted where RF
113
outperformed all other classifiers in terms of AUC values. Figure A2.8 illustrates
the AUC-ROC curves for RF and MLP classifiers that achieved the higher AUC
values while tested on the holdout set for simplicity. In addition to that, DeLong’s
test (DeLong et al., 1988) of comparing AUC values demonstrated similar significant
differences as the one-sided Wilcoxon signed-rank test (Wilcoxon, 1992) showed. It
can be observed from Table A2.6 that all the calculated AUC values are within the
95% confidence interval for true AUC values. Moreover, the values of Z and p are
consistent, i.e., in case of significant values of p, we accept the alternate hypothesis
that the values of AUC for classifiers trained on MI-based features are higher than
the values of AUC for classifiers trained on EEG-based feature and the signs of test
statistics, Z express the same relation between the AUC values. However, according
to the performance metrics, in MWL classification, RF achieved the highest AUC
value of 0.92 with accuracy 82% with MI-based features and the AUC value was 0.96
(Figure A2.8a) with accuracy 88% (Table A2.7) with EEG-based features. Again,
the performance on event classification (Car or Pedestrian) was evaluated with the
same ML algorithms considering both the feature sets. In event classification result,
RF with MI-based features with AUC value 0.98 outperformed EEG-based features
with AUC value 0.95 (Figure A2.8b). The accuracy on the test set in the classifying
event was found to be 94% by the RF classifier by using MI-based features, which is
the best performance achieved in this whole study (Table A2.8).
A2.6 Conclusion
In conclusion, the present study was carried out through a driving experiment in
a real environment, which was aimed at investigating the utilization of vehicular
signals in evaluation of MWL of drivers with a view to reduce the effort of using
EEG signals and eliminate the task of managing redundant EEG signal recording
apparatuses. This paper presents an MI-based feature set construction methodology
with the combination of EEG and vehicular signals. The feature set was deployed to
evaluate drivers’ MWL in terms of score and labels. Several ML models were trained
to perform the evaluation tasks. The values of MAE in MWL score prediction showed
that there was approximately no difference between the predicted score generated
using MI-based features and EEG features. On the other hand, in classification
tasks, it was observed that RF classifiers performed better than other classifiers
in labeling MWL and events in terms of performance metrics of ML models, but
through statistical tests it was observed that SVM performed significantly better
than all other classifiers. While classifying MWL, the highest accuracy observed
was 88% with EEG-based features and 82% with MI-based features. Furthermore,
using MI-based features outperformed EEG-based features in two specific events
(a pedestrian crossing the road and a car entering in the traffic flow) classification
with an accuracy of 94%. Though the accuracy in MWL classification from the
developed feature set was not equivalent to EEG features, the accuracy in event
classification urges the need of re-evaluation of the proposed fusion methodology of
feature extraction with higher number of vehicular features in future studies.
Author Contributions. Conceptualization, M.R.I. and S.B. (Shaibal Barua);

data curation, M.R.I., P.A., G.B. and G.D.F.; formal analysis, M.R.I.; investigation,
114
Paper A2
M.R.I.; methodology, M.R.I. and G.D.F.; resources, M.R.I., S.B. (Shaibal Barua) and
G.D.F; software, M.R.I.; supervision, M.U.A. and S.B. (Shahina Begum); validation,
M.R.I., S.B. (Shaibal Barua) and G.D.F.; visualization, M.R.I.; writing – original
draft preparation, M.R.I. and G.D.F.; writing – review and editing, M.R.I., S.B.
(Shaibal Barua), M.U.A., S.B. (Shahina Begum) and G.D.F. All authors have read
and agreed to the published version of the manuscript.
Funding. This research was performed as a part of the project BrainSafeDrive

co-funded by the Vetenskapsrådet - The Swedish Research Council and the Ministero
dell’Istruzione dell’Università e della Ricerca della Repubblica Italiana under
Italy-Sweden Cooperation Program.
Conflicts of Interest. The authors declare no conflict of interest.
Abbriviations. The following abbreviations are used frequently in this manuscript:

EEG Electroencephalography
IAF Individual Alpha Frequency
LgR Logistic Regression
LnR Linear Regression
MI Mutual Information
ML Machine Learning
MLP Multilayer Perceptron
MWL Mental Workload
RF Random Forest
SVM Support Vector Machine
Bibliography
Almahasneh, H., Kamel, N., Walter, N., & Malik, A. S. (2015). EEG-based Brain
Functional Connectivity during Distracted Driving. 2015 IEEE International
Conference on Signal and Image Processing Applications (ICSIPA), 274–277.
Antonenko, P. D. (2007). The Effect of Leads on Cognitive Load and Learning
in a Conceptually Rich Hypertext Environment (PhD Thesis). Iowa State
University.
Elsevier.
Aricò, P., Borghini, G., Di Flumeri, G., & Babiloni, F. (2017). Method for
Estimating a Mental State, In particular a Workload, and Related Apparatus
(EP3143933A1).
115
Aricò, P., Borghini, G., Di Flumeri, G., Colosimo, A., Bonelli, S., Golfetti, A.,
Pozzi, S., Imbert, J.-P., Granger, G., Benhacene, R., & Babiloni, F. (2016).
Adaptive Automation Triggered by EEG-Based Mental Workload Index:
A Passive Brain-Computer Interface Application in Realistic Air Traffic
Control Environment. Frontiers in Human Neuroscience, 10, 539.
Engineering, 64 (7), 1431–1436.
Barua, S. (2019). Multivariate Data Analytics to Identify Driver’s Sleepiness,
Cognitive Load, and Stress (PhD Thesis). Mälardalen University.
Begum, S., Barua, S., & Ahmed, M. U. (2017). In-Vehicle Stress Monitoring
Based on EEG Signal. International Journal of Engineering Research and
Applications, 07 (07), 55–71.
Borghini, G., Aricò, P., Di Flumeri, G., Cartocci, G., Colosimo, A., Bonelli, S.,
Golfetti, A., Imbert, J. P., Granger, G., Benhacene, R., Pozzi, S., &
Babiloni, F. (2017). EEG-Based Cognitive Control Behaviour Assessment:
An Ecological study with Professional Air Traffic Controllers. Scientific
Reports, 7 (1), 547.
Borghini, G., Aricò, P., Di Flumeri, G., Sciaraffa, N., Colosimo, A., Herrero, M.-T.,
Bezerianos, A., Thakor, N. V., & Babiloni, F. (2017). A New Perspective
for the Training Assessment: Machine Learning-Based Neurometric for
Augmented User’s Evaluation. Frontiers in Neuroscience, 11.
Borghini, G., Astolfi, L., Vecchiato, G., Mattia, D., & Babiloni, F. (2014). Measuring
Neurophysiological Signals in Aircraft Pilots and Car Drivers for the
Assessment of Mental Workload, Fatigue and Drowsiness. Neuroscience &
Biobehavioral Reviews, 44, 58–75.
Breiman, L. (2001). Random Forests. Machine Learning, 45 (1), 5–32.
Brookhuis, K. A., & de Waard, D. (2010). Monitoring Drivers’ Mental Workload
in Driving Simulators using Physiological Measures. Accident Analysis &
Prevention, 42 (3), 898–903.
116
Paper A2
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.).
John Wiley & Sons, Inc.
da Silva, F. P. (2014). Mental Workload, Task Demand and Driving Performance:
What Relation? Procedia - Social and Behavioral Sciences, 162, 310–319.
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the Areas
under Two or More Correlated Receiver Operating Characteristic Curves: A
Nonparametric Approach. Biometrics, 44 (3), 837–845.
Delorme, A., & Makeig, S. (2004). EEGLAB: An Open Source Toolbox for Analysis
of Single-trial EEG Dynamics including Independent Component Analysis.
Journal of Neuroscience Methods, 134 (1), 9–21.
Di Flumeri, G., Arico, P., Borghini, G., Colosimo, A., & Babiloni, F. (2016). A New
Regression-based Method for the Eye Blinks Artifacts Correction in the EEG
Signal, without using any EOG Channel. Annual International Conference of
the IEEE Engineering in Medicine and Biology Society. IEEE Engineering
in Medicine and Biology Society. Annual International Conference, 2016,
3187–3190.
Di Flumeri, G., Borghini, G., Aricò, P., Colosimo, A., Pozzi, S., Bonelli, S., Golfetti,
A., Kong, W., & Babiloni, F. (2015). On the Use of Cognitive Neurometric
Indexes in Aeronautic and Air Traffic Management Environments. In B.
Blankertz, G. Jacucci, L. Gamberini, A. Spagnolli, & J. Freeman (Eds.),
Symbiotic Interaction. Symbiotic 2015. Lecture Notes in Computer Science
(pp. 45–56). Springer International Publishing.
121–126.
Di Flumeri, G., De Crescenzio, F., Berberian, B., Ohneiser, O., Kramer,
J., Aricò, P., Borghini, G., Babiloni, F., Bagassi, S., & Piastra, S.
(2019). Brain–Computer Interface-Based Adaptive Automation to Prevent
Out-Of-The-Loop Phenomenon in Air Traffic Controllers Dealing With
Highly Automated Systems. Frontiers in Human Neuroscience, 13.
Fastenmeier, W., & Gstalter, H. (2007). Driving Task Analysis as a Tool in Traffic
Safety Research and Practice. Safety Science, 45 (9), 952–979.
Fisher, D. L., Rizzo, M., Caird, J., & Lee, J. D. (2011). Handbook of Driving
Simulation for Engineering, Medicine, and Psychology: An Overview. In
D. L. Fisher, M. Rizzo, J. Caird, & J. D. Lee (Eds.), Handbook of Driving
Simulation for Engineering, Medicine, and Psychology (1st ed., pp. 1–16).
CRC Press.
Freedman, D. (2009). Statistical Models: Theory and Practice. Cambridge University
Press.
117
Galante, F., Bracco, F., Chiorri, C., Pariota, L., Biggero, L., & Bifulco, G. N. (2018).
Validity of Mental Workload Measures in a Driving Simulation Environment.
Journal of Advanced Transportation, 2018, e5679151.
Geethanjali, P., Mohan, Y. K., & Sen, J. (2012). Time Domain Feature Extraction
and Classification of EEG Data for Brain Computer Interface. 2012
9th International Conference on Fuzzy Systems and Knowledge Discovery
(FSKD), 1136–1139.
Gevins, A., & Smith, M. E. (2003). Neurophysiological Measures of Cognitive
Workload during Human-Computer Interaction. Theoretical Issues in
Ergonomics Science, 4 (1-2), 113–131.
Gevins, A., Smith, M. E., Leong, H., McEvoy, L., Whitfield, S., Du, R., & Rush, G.
(1998). Monitoring Working Memory Load during Computer-Based Tasks
with EEG Pattern Recognition Methods. Human Factors, 40 (1), 79–91.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene Selection for Cancer
Classification using Support Vector Machines. Machine Learning, 46 (1),
389–422.
Guzik, P., & Malik, M. (2016). ECG by Mobile Technologies. Journal of
Electrocardiology, 49 (6), 894–901.
Harms, L. (1991). Variation in Drivers’ Cognitive Load. Effects of Driving through
Village Areas and Rural Junctions. Ergonomics, 34 (2), 151–160.
Hart, S. G., & Staveland, L. E. (1988). Development of NASA-TLX (Task Load
Index): Results of Empirical and Theoretical Research. In P. A. Hancock &
N. Meshkati (Eds.), Advances in Psychology (pp. 139–183). North-Holland.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical
Learning. Springer New York.
Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science
& Engineering, 9 (3), 90–95.
Kar, S., Bhagat, M., & Routray, A. (2010). EEG Signal Analysis for the Assessment
and Quantification of Driver’s Fatigue. Transportation Research Part F:
Traffic Psychology and Behaviour, 13 (5), 297–306.
Kirk, R. E. (2012). Experimental Design. In I. B. Weiner, J. Schinka, & W. F. Velicer
(Eds.), Handbook of Psychology (2nd ed.). Wiley.
Lei, S., & Roetting, M. (2011). Influence of Task Combination on EEG Spectrum
Modulation for Driver Workload Estimation. Human Factors, 53 (2),
168–179.
Li, X., Zhang, P., Song, D., Yu, G., Hou, Y., & Hu, B. (2015). EEG Based Emotion
Identification Using Unsupervised Deep Feature Learning. Proceedings of the
SIGIR2015 Workshop on Neuro-Physiological Methods in IR Research.
Manawadu, U. E., Kawano, T., Murata, S., Kamezaki, M., Muramatsu, J., & Sugano,
S. (2018). Multiclass Classification of Driver Perceived Workload Using Long
Short-Term Memory based Recurrent Neural Network. 2018 IEEE Intelligent
Vehicles Symposium (IV), 1–6.
McKinney, W. (2010). Data Structures for Statistical Computing in Python.
Proceedings of the 9th Python in Science Conference, 56–61.
118
Paper A2
Moustafa, K., Luz, S., & Longo, L. (2017). Assessment of Mental Workload:
A Comparison of Machine Learning Methods and Subjective Assessment
Techniques. In L. Longo & M. C. Leva (Eds.), Human Mental Workload:
Models and Applications. H-WORKLOAD 2017. Communications in
Computer and Information Science (pp. 30–50). Springer International
Publishing.
Paxion, J., Galy, E., & Berthelon, C. (2014). Mental Workload and Driving. Frontiers
in Psychology, 5.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011).
Scikit-learn: Machine Learning in Python. Journal of Machine Learning
Research, 12 (85), 2825–2830.
R Core Team. (2013). R: The R Project for Statistical Computing (tech. rep.). R
Foundation for Statistical Computing. Vienna, Austria.
Rahman, H., Ahmed, M. U., Barua, S., & Begum, S. (2020). Non-contact-based
Driver’s Cognitive Load Classification using Physiological and Vehicular
Parameters. Biomedical Signal Processing and Control, 55, 101634.
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller,
M. (2011). pROC: An Open-source Package for R and S+ to Analyze and
Compare ROC Curves. BMC Bioinformatics, 12 (1), 77.
Saccá, V., Campolo, M., Mirarchi, D., Gambardella, A., Veltri, P., & Morabito,
F. C. (2018). On the Classification of EEG Signal by Using an SVM
Based Algorithm. In A. Esposito, M. Faudez-Zanuy, F. C. Morabito, & E.
Pasero (Eds.), Multidisciplinary Approaches to Neural Computing. Smart
Innovation, Systems and Technologies (pp. 271–278). Springer International
Publishing.
Sakai, M. (2013). Kernel Nonnegative Matrix Factorization with Constraint
Increasing the Discriminability of Two Classes for the EEG Feature
Extraction. 2013 International Conference on Signal-Image Technology &
Internet-Based Systems (SITIS), 966–970.
Sam, D., Velanganni, C., & Evangelin, T. E. (2016). A Vehicle Control System using
a Time Synchronized Hybrid VANET to Reduce Road Accidents caused by
Human Error. Vehicular Communications, 6, 17–28.
Sherwani, F., Shanta, S., Ibrahim, B. S. K. K., & Huq, M. S. (2016). Wavelet based
Feature Extraction for Classification of Motor Imagery Signals. 2016 IEEE
EMBS Conference on Biomedical Engineering and Sciences (IECBES),
360–364.
Smith, M. E., McEvoy, L. K., & Gevins, A. (1999). Neurophysiological Indices of
Strategy Development and Skill Acquisition. Cognitive Brain Research, 7 (3),
389–404.
119
Solovey, E. T., Zec, M., Garcia Perez, E. A., Reimer, B., & Mehler, B. (2014).
Classifying Driver Workload using Physiological and Driving Performance
Data: Two Field Studies. Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (CHI), 4057–4066.
Thomas, P., Morris, A., Talbot, R., & Fagerlind, H. (2013). Identifying the Causes of
Road Crashes in Europe. Annals of Advances in Automotive Medicine, 57,
13–22.
Travis, E. O. (2015). Guide to NumPy (2nd ed.). CreateSpace Independent
Publishing Platform.
van Rossum, G. (1995). Python Tutorial (tech. rep.). Centrum voor Wiskunde en
Informatica. Amsterdam, The Netherlands.
Verwey, W. B. (2000). On-line Driver Workload Estimation. Effects of Road Situation
and Age on Secondary Task Measures. Ergonomics, 43 (2), 187–209.
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau,
D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt,
S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J.,
Jones, E., Kern, R., Larson, E., . . . van Mulbregt, P. (2020). SciPy 1.0:
Fundamental algorithms for scientific computing in Python. Nature Methods,
17 (3), 261–272.
Wei, Z., Wu, C., Wang, X., Supratak, A., Wang, P., & Guo, Y. (2018). Using Support
Vector Machine on EEG for Advertisement Impact Assessment. Frontiers in
Neuroscience, 12.
Wen, T., & Zhang, Z. (2018). Deep Convolution Neural Network and
Autoencoders-Based Unsupervised Feature Learning of EEG Signals. IEEE
Access, 6, 25399–25410.
Wickens, C. D., McCarley, J. S., Alexander, A. L., Thomas, L. C., Ambinder, M.,
& Zheng, S. (2008). Attention-Situation Awareness (A-SA) Model of Pilot
Error. In D. C. Foyle & B. L. Hooey (Eds.), Human Performance Modeling
in Aviation (pp. 213–239). CRC Press.
Wilcoxon, F. (1992). Individual Comparisons by Ranking Methods. In S. Kotz &
N. L. Johnson (Eds.), Breakthroughs in Statistics (pp. 196–202). Springer
New York.
World Medical Association. (2001). World Medical Association Declaration of
Helsinki: Ethical Principles for Medical Research Involving Human Subjects.
Bulletin of the World Health Organization, 79 (4), 373.
Yin, Z., & Zhang, J. (2016). Recognition of Cognitive Task Load levels using single
channel EEG and Stacked Denoising Autoencoder. Proceedings of the 35th
Chinese Control Conference (CCC), 3907–3912.
120
Paper B
A Systematic Review of Explainable Artificial B

Intelligence in terms of Different Application
Domains and Tasks
Islam, M. R., Ahmed, M. U., Barua, S. & Begum, S.

Paper B
A Systematic Review of Explainable

Artificial Intelligence in Terms of
Different Application Domains and
Tasks†
Abstract
Artificial intelligence (AI) and machine learning (ML) have recently
been radically improved and are now being employed in almost every
application domain to develop automated or semi-automated systems.
To facilitate greater human acceptability of these systems, explainable
artificial intelligence (XAI) has experienced significant growth over the
last couple of years with the development of highly accurate models but
with a paucity of explainability and interpretability. The literature shows
evidence from numerous studies on the philosophy and methodologies of XAI.
Nonetheless, there is an evident scarcity of secondary studies in connection
with the application domains and tasks, let alone review studies following
prescribed guidelines, that can enable researchers’ understanding of the
current trends in XAI, which could lead to future research for domain- and
application-specific method development. Therefore, this paper presents
a systematic literature review (SLR) on the recent developments of XAI
methods and evaluation metrics concerning different application domains
and tasks. This study considers 137 articles published in recent years and
identified through the prominent bibliographic databases. This systematic
synthesis of research articles resulted in several analytical findings: XAI
methods are mostly developed for safety-critical domains worldwide, deep
learning and ensemble models are being exploited more than other types
of AI/ML models, visual explanations are more acceptable to end-users
and robust evaluation metrics are being developed to assess the quality
† © 2022 by the Authors (CC BY 4.0). Reprinted from Islam, M. R., Ahmed, M. U., Barua,
S., & Begum, S. (2022). A Systematic Review of Explainable Artificial Intelligence in Terms of
Different Application Domains and Tasks. Applied Sciences, 12 (3), 1353.
123
of explanations. Research studies have been performed on the addition of

explanations to widely used AI/ML models for expert users. However, more
attention is required to generate explanations for general users from sensitive
domains such as finance and the judicial system.
Keywords: Explainable Artificial Intelligence · Explainability ·
Evaluation Metrics · Systematic Literature Review.
B.1 Introduction
With the recent developments of artificial intelligence (AI) and machine learning
(ML) algorithms, people from various application domains have shown increasing
interest in taking advantage of these algorithms. As a result, AI and ML are
being used today in many application domains. Different AI/ML algorithms are
being employed to complement humans’ decisions in various tasks from diverse
domains, such as education, construction, health care, news and entertainment,
travel and hospitality, logistics, manufacturing, law enforcement, and finance (Rai,
2020). While these algorithms are meant to help users in their daily tasks, they still
face acceptability issues. Users often remain doubtful about the proposed decisions.
In worse cases, users oppose the AI/ML model’s decision since their inference
mechanisms are mostly opaque, unintuitive, and incomprehensible to humans. For
example, today, deep learning (DL) models demonstrate convincing results with
improved accuracy compared to established algorithms. DL models’ outstanding
performances hide one major drawback, i.e., the underlying inference mechanism
remains unknown to a user. In other words, the DL models function as a black-box
(Guidotti, Monreale, Ruggieri, et al., 2019). In general, almost all the prevailing
expert systems built with AI/ML models do not provide additional information to
support the inference mechanism, which makes systems nontransparent. Thus, it has
become a sine qua non to investigate how the inference mechanism or the decisions
of AI/ML models can be made transparent to humans so that these intelligent
systems can become more acceptable to users from different application domains
(Loyola-Gonzalez, 2019).
Upon realising the need to explain AI/ML model-based intelligent systems, a few
researchers started exploring and proposing methods long ago. The bibliographic
databases contain the earliest published evidence on the association between expert
systems and the term explanation from the mid-eighties (Neches et al., 1985). Over
time, the concept evolved to be an immense growing research domain of explainable
artificial intelligence (XAI). However, researchers did not pay much attention to XAI
until 2017/2018 which can be justified by the trend of publications per year with
the keyword explainable artificial intelligence in titles or abstracts from different
bibliographic databases illustrated in Figure B.1a. The increased attention paid
by researchers towards XAI from all the domains utilising systems developed with
AI/ML models was caused by three major incidents. First of all, the funding of
the “Explainable AI (XAI) Program” was funded in early 2017 by the Defense
Advanced Research Projects Agency (DARPA) (Gunning & Aha, 2019). After a
couple of months in mid-2017, the Chinese government released “The Development
Plan for New Generation of Artificial Intelligence” to encourage the high and strong
extensibility of AI (Xu et al., 2019). Last but not least, in mid-2018, the European
124
Paper B
(a) (b)
Figure B.1: Number of published articles (y-axis) on XAI made available through
four bibliographic databases in recent decades (x-axis). (a) Trend of the number of
publications from 1984 to 2020. (b) Specific number of publications from 2018 to
June 2021. The illustrated data were extracted on 01 July 2021 from four renowned
bibliographic databases. The asterisk (*) with 2021 refer to the partial data on the
number of publications on XAI until June. © 2022 by Islam et al. (CC BY 4.0).
Union granted their citizens a “Right to Explanation” if they were affected by

algorithmic decision making by publishing the “General Data Protection Regulation”
(GDPR) (Wachter et al., 2018). The impact of these events is prominent among
the researchers since the search results from the significant bibliographic databases
depict a rapidly increasing number of publications related to XAI during recent years
(Figure B.1b). The bibliographic databases that were considered to assess the number
of publications per year on XAI were found to be the main sources of the research
articles from the AI domain.
(a) (b)
Figure B.2: Percentage of the selected articles on different XAI methods for different
application (a) domains and (b) tasks. © 2022 by Islam et al. (CC BY 4.0).
The continuously increasing momentum of publications in the domain of XAI is

producing an abundance of knowledge from various perspectives, e.g., philosophy,
taxonomy, and development. Unfortunately, this scattered plentiful knowledge
and the use of different closely related taxonomies interchangeably demand the
125
organisation and definition of boundaries through a systematic literature review

(SLR), as it contains a structured procedure for conducting the review with provisions
for assessing the outcome in terms of a predefined goal. Figure B.2 presents the
distribution of articles on XAI methods for various application domains and tasks.
From Figure B.2a, it is realisable that today, most of XAI methods are developed
as domain agnostic. However, the most influential use of XAI is in the healthcare
domain; this may be because of the demand for explanations from the end-user
perspective. Obviously, in many application domains, AI and ML methods are used
for decision support systems, and the need for XAI is high for decision support tasks,
as can be seen in Figure B.2b. Although there is an increasing number of publications,
some challenges have not been considered, for example, user-centric and domain
knowledge incorporating explanation. This article aimed to present the outcome of
an SLR on the current developments and trends in XAI for different application
domains by summarising the methods and evaluation metrics for explainable AI/ML
models. Moreover, the aim of this SLR includes identifying the specific domains
and applications in which XAI methods are exploited and that are to be further
investigated. To achieve the aim of this study, three major objectives are highlighted:
• To investigate and present the application domains and tasks for which various
XAI methods have been explored and exploited;
• To investigate and present the XAI methods, validation metrics and the type of
explanations that can be generated to increase the acceptability of the expert
systems to general users;
• To sort out the open issues and future research directions in terms of various
domains and application tasks from the methodological perspective of XAI.
The remainder of this article is arranged as follows: relevant concepts of XAI from
a technical point of view are presented in Section B.2, followed by a discussion on
prominent review studies previously conducted on XAI in Section B.3. Section B.4
contains the detailed workflow of this SLR, followed by the outcome of the performed
analyses in Section B.5. Finally, a discussion on the findings of this study and its
limitations and conclusions are presented in Sections B.6 and B.7, respectively.
B.2 Theoretical Background
This section concisely presents the theoretical aspects of XAI from a technical point
of view for a better understanding of the contents of this study. Emphatically,
the philosophy and taxonomy of XAI have been excluded from this manuscript
because they are out of the scope of this study. However, the term explainability
is associated with the interface between decision makers and humans. This interface
is synchronously comprehensible to humans and accurately represents the decision
maker (Guidotti, Monreale, Ruggieri, et al., 2019). Specifically, in XAI, the interface
between the models and the end-users is called explainability, through which an
end-user obtains clarification on the decisions that the AI/ML model provides them
with. Based on the literature, the concepts of XAI within different application
domains are categorised as stage, scope, input and output formats. This section
126
Paper B
includes a discussion on the most relevant aspects that seem necessary to make XAI
efficiently and credibly work on different applications. Figure B.3 summarises the
prime concepts behind developing XAI applications which were adopted from the
recent review studies by Vilone and Longo (2020, 2021a).
Figure B.3: Overview of the different concepts on developing methodologies for XAI,
adapted from the review studies by Vilone and Longo (2020, 2021a). © 2022 by Islam
et al. (CC BY 4.0).
B.2.1 Stage of Explainability

The AI/ML models learn the fundamental characteristics of the supplied data
and subsequently try to cluster, predict or classify unseen data. The stage of
explainability refers to the period in the process mentioned above when a model
generates the explanation for the decision it provides. According to Vilone and
Longo (2020, 2021a), the stages are ante hoc and post hoc. Brief descriptions of the
stages are as follows:
• Ante hoc methods generally consider generating the explanation for the decision
from the very beginning of the training on the data while aiming to achieve
optimal performance. Mostly, explanations are generated using these methods
for transparent models, such as fuzzy models and tree-based models;
• Post hoc methods comprise an external or surrogate model and the base model.
The base model remains unchanged, and the external model mimics the base
model’s behaviour to generate an explanation for the users. Generally, these
methods are associated with the models in which the inference mechanism
remains unknown to users, e.g., support vector machines and neural networks.
Moreover, the post hoc methods are again divided into two categories:
model-agnostic and model-specific. The model-agnostic methods apply to any
AI/ML model, whereas the model-specific methods are confined to particular
models.
127
B.2.2 Scope of Explainability

The scope of explainability defines the extent of an explanation produced by some
explainable methods. Two recent literature studies on more than 200 scientific articles
published on XAI deduced that the scope of explainability can be either global or local
(Vilone & Longo, 2020, 2021a). With a global scope, the whole inferential technique
of a model is made transparent or comprehensible to the user, for example, a decision
tree. On the other hand, explanation with a local scope refers to explicitly explaining
a single instance of inference to the user, e.g., for decision trees, a single branch can
be termed as a local explanation.
B.2.3 Input and Output

Along with the core concepts, stages and scopes of explainability, input and output
formats were also found to be significant in developing XAI methods (Guidotti,
Monreale, Ruggieri, et al., 2019; Vilone & Longo, 2020, 2021a). The explainable
models’ mechanisms unquestionably differ when learning different input data types,
such as images, numbers, texts, etc. Including these basic forms of input, several
others are found to be utilised in different studies, which are elaborately discussed in
Section B.5.3.1. Finally, the prime concern of XAI, the output format or the form of
explanation varies following the solution to the prior problems. The different forms
of explanation simultaneously vary concerning the circumstances and expertise of
the end-users. The most common forms of explanations are numeric, rules, textual,
visual and mixed. These forms of explanation are illustrated and briefly discussed in
Section B.5.3.4.
B.3 Related Studies
During the past couple of years, research on the developing theories, methodologies
and tools of XAI has been very active, and over time, the popularity of XAI
as a research domain has continued to increase. Before the massive attention of
researchers towards XAI, the earliest review that could be found in the literature
was that by Lacave and Díez (2002). They reviewed the then prevailing explanation
methods precisely for Bayesian networks. In the article, the authors referred to the
level and methods of explanations followed by several techniques that were mostly
probabilistic. Later, Ribeiro et al. (2016b) reviewed the suggested interpretable
models as a solution to the problem of adding explainability to AI/ML models,
such as additive models, decision trees, attention-based networks, and sparse linear
models. Subsequently, they proposed a model-agnostic technique that involves the
combined development of an interpretable model from the predictions of black-box
and perturbing inputs to observe the reaction of black-box models (Ribeiro et al.,
2016a).
With the remarkable implications of GDPR, an enormous number of works have
been published in recent years. The initial works included the notion of explainability
and its use from different points of view. Alonso et al. (2018) accumulated the
bibliometric information on the XAI domain to understand the research trends,
identify the potential research groups and locations, and discover possible research
directions. Goebel et al. (2018) discussed older concepts and linked them to newer
128
Paper B
concepts such as deep learning. Black-box models were compared with the white-box
models based on their advantages and disadvantages from a practical point of view
(Loyola-Gonzalez, 2019). Additionally, survey articles were published that advocated
that explainable models replace black-box models for high-stakes decision-making
tasks (Rudin, 2019; Rai, 2020). Surveys were also conducted on the methods of
explainability and addressed the philosophy behind the usage from the perspective
of different domains (Došilović et al., 2018; Mittelstadt et al., 2019; Samek & Müller,
2019) and stakeholders (Preece et al., 2018). Some works included the specific
definitions of technical terms, possible applications, and challenges towards attaining
responsible AI (Xu et al., 2019; Barredo Arrieta et al., 2020; Longo et al., 2020).
Adadi and Berrada (2018) and Guidotti, Monreale, Ruggieri, et al. (2019) separately
studied the available methods of explainability and clustered them in the form of
explanations, e.g., textual, visual, and numeric. However, the literature contains a
good number of review studies on specific forms or methods of explaining AI/ML
models. For example, Robnik-Šikonja and Bohanec (2018) conducted a literature
review on the perturbation-based explanations for prediction models, Q. Zhang et al.
(2018) surveyed the techniques of providing visual explanations for deep learning
models, and Dağlarli (2020) reviewed the XAI approaches for deep meta-learning
models.
Above all, several review studies were conducted by Vilone and Longo (2020,
2021a, 2021b) to gather and present the recent developments in XAI. These studies
presented extensive clustering of the XAI methods and evaluation metrics, which
makes the studies more robust than the other review studies from the literature.
However, none of these studies presented insights on the application domains and
tasks that are facilitated with the developments of XAI. However, researchers from
specific domains also surveyed the possibilities and challenges from their perspectives.
The literature contains most of the works from the medical and health care domains
(Fellous et al., 2019; Holzinger et al., 2019; Mathews, 2019; Jiménez-Luna et al.,
2020; Payrovnaziri et al., 2020; Ahmed et al., 2021; Gulum et al., 2021). However,
there are review articles available in the literature from the domains of industry
(Gade et al., 2019), software engineering (Dam et al., 2018), automotive (Chaczko
et al., 2020), etc.
In the studies mentioned above, the authors reviewed and analysed the concepts
and methodologies of XAI, challenges and possible actions to the solutions from the
perspective of individual domains or without concerning the application domains
and tasks. However, to our knowledge, none of the studies exploited XAI methods
considering different application domains and tasks as a whole. Moreover, a survey
following an SLR guideline to review the methods and evaluation metrics for XAI
to maintain a rigid objective throughout the study is still not present. Hence, in
this article, an established guideline for SLR (Kitchenham & Charters, 2007) was
followed to gather and analyse the available methods of adding explainability to
AI/ML models and the metrics of assessing the performance of the methods as well
as the quality of the generated explanations. In addition, this survey study produced
a general notion on the utilisation of XAI in different application domains based on
the selected articles.
129
B.4 SLR Methodology
The methodology was designed according to the guidelines provided by Kitchenham

and Charters for conducting an SLR (Kitchenham & Charters, 2007). The guidelines
contain clear and robust steps for identifying and analysing potential research works
intending to consider future research possibilities followed by the proper reporting of
the SLR. The SLR methodology includes three stages: (i) planning the review ; (ii)
conducting the review ; and (iii) reporting the review. The SLR methodology stages
are briefly illustrated in Figure B.4. The first two stages are broken down into major
aspects and described in the following subsections, while the third stage, reporting
the SLR, is self-explanatory.
Stage I: Planning Stage II: Conducting Stage III: Reporting
# Identifying the need for # Identifying potential # Preparing the SLR

the SLR. research articles. report.
# Specifying the research # Data extraction. # Evaluating the report.
questions. # Questionnaire survey.
# Developing the review # Data analysis.
protocol.
Figure B.4: SLR methodology stages following the guidelines from Kitchenham and
Charters (2007). © 2022 by Islam et al. (CC BY 4.0).
B.4.1 Planning the SLR

The first stage involves creating a comprehensive research plan for the SLR. This
stage includes identifying the need for conducting the SLR, outlining the research
questions (RQs) and determining a detailed protocol for the research works to be
accomplished.
B.4.1.1 Identifying the Need for Conducting the SLR

In a continuation of the discussion in Sections B.1 and B.3, with the increasing
number of research works on XAI methodologies, the underlying knowledge becomes
increasingly disorganised. However, very few secondary studies have been conducted
solely to organise the profuse knowledge on the methodologies of XAI. In addition, no
evidence of an SLR was found in the investigated bibliographic databases. Therefore,
the need to conduct an SLR is stipulated to compile and analyse the primary
publications on the methods and metrics of XAI and purposefully present an extensive
and unbiased review.
B.4.1.2 Research Questions

Considering the urge to conduct an SLR of the exploited methods of providing
explainability for AI/ML systems and their evaluations in different application
domains and tasks, several RQs were formulated. Primarily, the questions were
defined to investigate the prevailing approaches towards making AI/ML models
130
Paper B
explainable. This included the probe of explainable models by design, different

structures of the generated explanation, and the significant application domains
and tasks utilising the XAI methods. Furthermore, the means of validating the
explainable models were also considered, followed by the open issues and future
research directions. For convenience, the RQs for conducting this SLR are outlined
as follows:
• RQ1: What are the application domains and tasks in which XAI is being
explored and exploited?
– RQ1.1: What are the XAI methods that have been used in the identified
application domains and tasks?
– RQ1.2: What are the different forms of providing explanations?
– RQ1.3: What are the evaluation metrics for XAI methods used in different
application domains and tasks?
B.4.1.3 SLR Protocol

The SLR protocol was designed to achieve the objective of this review by addressing
the RQs outlined in Section B.4.1.2. The protocol mainly contained the specification
of each aspect of conducting the SLR. First, the identification of the potential
bibliographic databases, the definition of the inclusion/exclusion criteria and quality
assessment questions, and the selection of research articles are discussed elaborately
in Section B.4.2.1. In the second step, thorough scanning of each of the articles was
performed, and relevant data were extracted and tabulated in a feature matrix. The
feature set was defined from the knowledge of previous review studies mentioned
in Section B.3, motivated by the RQs outlined in Section B.4.1.2. To support the
feature extraction process, a survey was conducted in parallel which involved the
corresponding/first authors of the selected articles. The survey responses were further
used to obtain missing data, clarify any unclear data, and assess the extracted data
quality. Upon completing feature extraction and the survey, an extensive analysis was
performed to complement the defined RQs. Finally, to portray this SLR outcome, all
the authors were involved in analysing the extracted features, and a detailed report
was generated.
B.4.2 Conducting the SLR

This is the prime stage of an SLR. In this stage, most of the significant activities
defined in the protocol were performed (Section B.4.1.3), i.e., identifying potential
research articles, conducting the author survey, extracting data and performing an
extensive analysis.
B.4.2.1 Identifying Potential Research Articles

Inclusion and exclusion criteria were determined to identify potential research articles
and are presented in Table B.1. The criteria for inclusion in the SLR were
peer-reviewed articles on XAI written in the English language and published in
peer-reviewed international conference proceedings and journals. The criteria for
131
exclusion from the SLR were articles that were related to the philosophy of XAI
and articles that were not published in any peer-reviewed conference proceedings
or journals. Throughout the article selection process, these inclusion and exclusion
criteria were considered.
Table B.1: Inclusion and exclusion criteria for the selection of research articles. ©
2022 by Islam et al. (CC BY 4.0).
Inclusion Criteria Exclusion Criteria

Describing the methods of XAI Describing the methods in different contexts than AI
Peer reviewed Describing the concept/philosophy of XAI
Published in conferences/journals Preprints and duplicates
Published from 2018 to June 2021 Published in workshops
Written in English Technical reports
To ensure the credibility of the selected articles, a checklist was designed. The list
contained 10 questions that were adapted from the guidelines for conducting an SLR
by Kitchenham and Charters (2007) and García-Holgado et al. (2020). Moreover,
to facilitate the validation, the questions were categorised on the basis of design,
conduct, analysis, and conclusion. The questions are outlined in Table B.2.
Table B.2: Questions for checking the validity of the selected articles. © 2022 by
Islam et al. (CC BY 4.0).
Perspective Quality Questions

Are the aims clearly stated?
Design If the study involves assessing a methodology, is the methodology
clearly defined?
Are the measures used in the study fully defined?
Was outcome assessment blind to treatment group?
Conduct
If two methodologies are being compared, were they treated similarly
within the study?
Do the researchers explain the form of data (numbers, images, etc.)?
Analysis
Do the numbers add up across different tables and methodologies?
Are all study questions answered?
Conclusion How do results compare with previous reports?
Do the researchers explain the consequences of any problems with the
validity of their measures?
The process for identifying potential research articles included the identification,
screening, eligibility, and sorting of the selected articles. A step-by-step flow diagram
of this identification process is illustrated using the “Preferred Reporting Items
for Systematic Reviews and Meta-Analyses” (PRISMA) diagram by Moher et al.
(2009) in Figure B.5. The process started in June 2021. An initial search was
conducted using Google Scholar (https://scholar.google.com/ accessed on 30 June
2021) with the keyword explainable artificial intelligence to assess the available
sources of the research articles. The search results showed that most of the articles
were extracted from SpringerLink (https://link.springer.com/ accessed on 30 June
2021), Scopus (https://www.scopus.com/ accessed on 30 June 2021), IEEE Xplore
132
Paper B
Potentially Relevant Articles from the Search
Identification
Results of Bibliographic Databases (1709)
excluded Articles which were Published

before 2018 (113)
Articles which were Published

during January 2018 - June 2021 (1596)
excluded Filtered out Articles through

Screening
inspecting Title/Abstract (949)
Articles Published in
the domain of AI (647)
excluded Duplicate/Preprint
Articles (376)
Peer Reviewed Articles Published

Eligibility
in Conferences/Journals (277)
excluded Notion/Review
Articles (159)
Thoroughly Scanned
Articles (118)
Included
Articles from Recursive included

Reference Search (19)
Selected Articles
for Analysis (137)
Figure B.5: Flow diagram of the research article selection process adapted
from the PRISMA flow chart by Moher et al. (2009). The number of articles
obtained/included/excluded at different stages is presented in parentheses. © 2022
by Islam et al. (CC BY 4.0).
(https://ieeexplore.ieee.org/ accessed on 30 June 2021) and the ACM Digital Library

(https://dl.acm.org/ accessed on 30 June 2021). Other similar sources were also
present, but those were not considered since they primarily indexed data from the
mentioned sources. Moreover, Google Scholar was not used for further article searches
since it was observed that the results contained articles from diverse domains. In
short, to narrow the search specifically to the AI domain, the mentioned databases
were set to be the main sources of research articles for this review. Initially, 1709
articles were extracted from the bibliographic databases after searching with the
keyword explainable artificial intelligence, as before. To focus this review on the
recent research works, 113 articles were excluded because they were published before
2018. A total of 1596 articles were selected for screening, and after reviewing the titles
or abstracts, more than half of the articles were excluded as they were not related to
AI and XAI. From the 647 articles screened from the AI domain, 376 articles were
excluded as they were duplicates or preprint versions of the articles. After evaluating
the eligibility of the published articles, 277 articles were further considered, and 159
articles were excluded because they were notions or review articles. Specifically, a
“yes” was provided for the selected articles for at least 7 out of the 10 quality questions
mentioned in Table B.2 following Genc-Nayebi and Abran (2017) and Da’u and Salim
(2020). Therefore, 118 articles were selected for a thorough review. During the
133
process, 19 additional related articles were found from a complementary snowballing

search (Wohlin, 2014), in simpler terms, a recursive reference search. Among the
newly included articles, some were published prior to 2018 but were included in this
study due to substantial contribution to the XAI domain. Finally, 137 articles were
selected for the authors’ survey, data/metrics extraction and analysis, among which
128 articles described different methodologies of XAI and 9 articles were solely related
to the evaluation of the explanations or the methods to provide explanations.
B.4.2.2 Data Collection

In this review study, the data collection was conducted in two parallel scenarios.
Several features were extracted by reading the published article. Simultaneously, a
questionnaire survey was distributed among the corresponding or first authors of the
selected articles to gather their subjective remarks on the article and some features
that were not clear from reading the articles. Each of the phases is elaborately
described in the following paragraphs.
Feature Extraction. All the selected articles on the methodologies and evaluation
of explainability were divided among the authors for thorough scanning to extract
several features. The features were extracted from several viewpoints, namely
metadata, primary task, explainability, explanation, and evaluation. The features
extracted as metadata contained information regarding the dissemination of the
selected study. Features from the viewpoint of the primary task were extracted to
assess a general idea of the variety of AI/ML models that were deliberately used to
perform classification or regression tasks prior to adding explanations to the models.
The last three sets of features were extracted related to the concept of explainability,
the explored or proposed method of making AI/ML models explainable and the
evaluation of the methods and generated explanations, respectively. After extracting
the features, a feature matrix was built to concentrate all the information for further
analysis. The principal features from the feature matrix are concisely presented in
Table B.3.
Questionnaire Survey. In parallel to the process of feature extraction through

reading the articles, a questionnaire survey was conducted among the corresponding
or first authors of the selected articles. The questionnaire was developed using Google
Forms and distributed through separate emails to authors. The prime motivation
behind the survey was to complement the feature extraction process by collecting
authors’ subjective remarks on their studies, curating the extracted features, and
gathering specific information that was not present or unclear in the articles. The
survey questionnaire contained queries on some of the features described in the
previous section. In addition to that, queries on the experts’ involvement, the use of
third-party tools, potential stakeholders of this study etc., were also present in the
questionnaire. In response to the invitation to the survey, approximately half of the
invited authors submitted their remarks voluntarily, and these responses add value
to the findings of this review.
134
Paper B
Table B.3: List of prominent features extracted from the selected articles. © 2022
Viewpoint Feature Description

Name of the conference/journal where the article was
Source
published.
Prominent words from the abstract and keywords sections
Metadata Keywords
that represents the concept of the article.
Domain The targeted domain for which the study was performed.
Application Specific application that was developed or enhanced.
The form of data that was used to develop a model, e.g.,
Data
images, texts.
Primary task AI/ML model that was used for performing the primary task
Model
of classification/regression.
Performance The performance of the models for the defined tasks.
The stage of generating explanation – during the training of
Stage
a model (ante hoc) or after the training ends (post hoc).
Whether the explanation is on the whole model, on a specific
Explainability Scope
inference instance, i.e., global, local or both.
The level for which explanation is generated, i.e., feature,
Level
decision or both.
Method The procedure of generating explanations.
Explanation The form of explanations generated for the models or the
Type
outcomes.
The technique of evaluating the explanation and the method
Approach
Evaluation of generating explanation.
Metrics The criteria of measuring the quality of the explanations.
B.4.2.3 Data Analysis

Following the completion of feature extraction from the selected articles and the
questionnaire survey by the authors of the articles, the available data were analysed
from multiple viewpoints, as presented in Table B.3. From the metadata, sources were
assessed to obtain an idea of the venues in which the works on XAI are published.
Furthermore, the author-defined keywords and the abstracts were analysed by
utilising natural language processing (NLP) techniques to assess the relevance of the
articles to the XAI domain. Afterwards, the selected articles were clustered based
on application domains and tasks to determine future research possibilities.
Before analysing the selected articles, clustering was performed in accordance with
the primary tasks and input data mentioned in Section B.2 and the method deployed
to perform the primary task. Additionally, the proposed methods of explainability
were clustered based on scopes and stages. Finally, the evaluation methods were
investigated. All the clustering and investigations performed in this review work
were intended to summarise the methods of generating explanations along with the
evaluation metrics and to present guidelines for the researchers devoted to exploiting
the domain of XAI.
135
B.5 Results
The findings from the performed analysis of the selected articles and the questionnaire
survey are presented concerning the viewpoints defined in Table B.3. To facilitate
a clear understanding, the subsections are titled with specific features, e.g., the
results from the analysis on primary tasks are presented in separate sections. Again,
the concepts of explainability are illustrated along with the methods to provide
explanations in the corresponding sections.
B.5.1 Metadata
This section presents the results obtained from analysing the metadata extracted from
the selected articles – primarily bibliometric data. Among the 137 selected articles,
83 were published in journals, and the rest were presented in conference proceedings.
As per the inclusion criteria of this SLR, all the articles were peer reviewed prior
to publication. In most of the articles, relevant keywords were the author-defined
keywords, which facilitates the indexing of the article in bibliographic databases.
The author-defined keywords were compared with the keywords extracted from the
abstracts of the articles through a word cloud approach. Figure B.6 illustrates the
word cloud of the author-defined keywords and the prominent words extracted from
the abstracts. The illustrated word clouds are expressed with varying font sizes.
More often occurring words are presented in larger fonts (Helbich et al., 2013) and
different colours are used to differentiate words with the same frequencies.
(a) (b)
Figure B.6: Word cloud of the (a) author-defined keywords and (b) keywords
extracted from the abstracts through natural language processing. The font size is
proportional to the number of occurrences of the terms and different colours are used
to discriminate terms with equal font size. Both figures illustrate remarkable terms of
XAI. However, the terms from keywords are more conceptual, whereas the abstracts
contain specific terms on the methods and tasks. © 2022 by Islam et al. (CC BY 4.0).
Figure B.7 presents the number of publications related to XAI from different
countries of the world. Here, the countries were determined based on the affiliations
of the first authors of the articles. The USA is the pioneer in the development of
XAI topics and is still in the leading position. Similarly, several countries in Europe
are following and have developed an increasing number of systems considering XAI.
Based on the number of publications, Asian countries are apparently still quiescent
in research and development on XAI.
136
Paper B
Number of Publications from Different Countries Top 10 Countries and

the Number of Publications
United States of America (30)
United Kingdom (11)
Spain (11)
Germany (10)
Italy (10)
France (9)
South Korea (6)
China (5)
Netherlands (4)
Singapore (3)
Publication Count
1 30
Figure B.7: Number of publications proposing new methods of XAI from different
countries of the world and the top 10 countries based on the publication count shown
in parentheses, which is approximately 72% of the 137 articles selected for this SLR.
The countries were determined from the affiliations of the first authors of the articles.
B.5.2 Application Domains and Tasks

To gain an idea of the research areas that have been enhanced with XAI, the
application domains and tasks were scrutinised. The number of articles on different
domains and tasks are illustrated in Figure B.2. Among the selected articles,
approximately 50% of the publications were domain-agnostic. Half of the remaining
articles were published in the domain of healthcare. Other domains of interest among
XAI researchers were found to be industry, transportation, the judicial system,
entertainment, academia, etc. Table B.4 presents the application domains and
corresponding tasks on which the selected articles substantially contributed. It is
evident from the content of the table that most of the published articles were not
specific to one domain, and safety-critical domains, such as healthcare, industry, and
transportation, received more attention from XAI researchers than domains, such
as telecommunication and security. Some domains can be clustered together in a
miscellaneous domain because of the small number of articles (as can be seen in Figure
B.2a). In the case of application tasks, most of the selected articles were published on
supervised and decision-support tasks. A good number of works have been published
on recommendation systems and systems developed on image processing tasks, e.g.,
object detection and facial recognition. Other noteworthy applications in the selected
articles were predictive maintenance and anomaly detection. It was also observed that
several articles presented works on supervised tasks, i.e., classification or prediction
without specifying the application. Moreover, very few articles have been published
on modelling gene relationships, business prediction, natural language processing,
etc. Figure B.8 presents a chord diagram (Tintarev et al., 2018) illustrating the
distribution of the articles published from different application domains for various
tasks. Most of the studies not specific to one domain were for decision support and
image processing tasks.
137
Table B.4: List of references to selected articles published on the methods of XAI
from different application domains for the corresponding tasks. © 2022 by Islam et al.
(CC BY 4.0).
Domain Application/ Study References

Task Count
Domain Supervised 23 Féraud and Clérot (2002), Štrumbelj and Kononenko (2014),
agnostic tasks Bonanno et al. (2017), Chander and Srinivasan (2018),
Laugel et al. (2018), Pierrard et al. (2018), Plumb et al.
(2018), Sabol et al. (2019), Alonso, Toja-Alamancos, et al.
(2020), Biswas et al. (2020), Cao et al. (2020), Fernández
et al. (2020), Holzinger et al. (2020), Kovalev and Utkin
(2020), Le et al. (2020), Lundberg et al. (2020), Galhotra
et al. (2021), Hatwell et al. (2021), La Gatta et al. (2021a,
2021b), Moradi and Samwald (2021), Rubio-Manzano et al.
(2021), and Z. Yang et al. (2021)
Image 20 Bach et al. (2015), Hendricks et al. (2016), Lundberg and
processing Lee (2017), Montavon et al. (2017), Q. Zhang et al. (2018),
Oramas et al. (2019), Angelov and Soares (2020), Apicella et
al. (2020), Dutta and Zielińska (2020), Murray et al. (2020),
Oh et al. (2020), Poyiadzi et al. (2020), Riquelme et al.
(2020), Selvaraju et al. (2020), Tan et al. (2020), Yeganejou
et al. (2020), Chandrasekaran et al. (2021), Y.-J. Jung et al.
(2021), Schorr et al. (2021), and S. C.-H. Yang et al. (2021)
Decision 13 Massie et al. (2005), Ribeiro et al. (2016a), Magdalena
support (2018), Ribeiro et al. (2018), Wachter et al. (2018),
García-Magariño et al. (2019), Guidotti, Monreale,
Giannotti, et al. (2019), Ming et al. (2019), Alonso, Ducange,
et al. (2020), De et al. (2020), M. A. Islam et al. (2020),
Meskauskas et al. (2020), and van der Waa et al. (2020)
Recommender 4 Bharadhwaj and Joshi (2018), Csiszár et al. (2020), A. Jung
system and Nardelli (2020), and Kouki et al. (2020)
Anomaly 1 Loyola-González et al. (2020)
detection
Evaluation 1 Dujmović (2020)
process
Natural 1 Ramos-Soto and Pereira-Fariña (2018)
language
processing
Predictive 1 Shalaeva et al. (2018)
maintenance
Time series 1 Karlsson et al. (2020)
tweaking
Healthcare Decision 20 Letham et al. (2015), Lage et al. (2018), Aghamohammadi
support et al. (2019), de Sousa et al. (2019), Kwon et al. (2019),
Lamy et al. (2019), Senatore et al. (2019), D. Wang et al.
(2019), Zheng et al. (2019), Brunese et al. (2020), Chou et
al. (2020), Dindorf et al. (2020), Hatwell et al. (2020), Lamy
et al. (2020), Lin et al. (2020), Panigutti et al. (2020), Soares
et al. (2020), Tabik et al. (2020), Hu and Beyeler (2021), and
Porto et al. (2021)
Risk prediction 4 Lundberg et al. (2018), Lindsay et al. (2020), Pintelas et al.
(2020), and Prifti et al. (2020)
Image 3 Graziani et al. (2020), Rio-Torto et al. (2020), and
processing Muddamsetty et al. (2021)
Continued on the next page ...
138
Paper B
Table B.4: (Continued) List of references to selected articles published on the methods
of XAI from different application domains for the corresponding tasks. © 2022 by Islam
et al. (CC BY 4.0).

Task Count
Recommender 2 D’Alterio et al. (2020) and Lauritsen et al. (2020)
system
Anomaly 1 Itani et al. (2020)
detection
Industry Predictive 5 Assaf and Schumann (2019), H.-Y. Chen and Lee (2020),
maintenance Hong et al. (2020), Serradilla et al. (2020), and Sun et al.
(2020)
Business 3 Rehse et al. (2019), Sarp et al. (2021), and K. Zhang et al.
management (2022)
Anomaly 1 Carletti et al. (2019)
detection
Modelling 1 Schönhof et al. (2021)
Transpor- Image 4 Li et al. (2020), Martínez-Cebrián et al. (2020), Ponn et al.
tation processing (2020), and Lorente et al. (2021)
Assistance 2 Kim and Canny (2017) and Nowak et al. (2019)
system
Academia Evaluation 3 Sokol and Flach (2020), Amparore et al. (2021), and van der
Waa et al. (2021)
Recommender 1 Weber et al. (2018)
system
Entertain- Recommender 3 Rutkowski et al. (2019), X. Wang et al. (2019), and Zhao
ment system et al. (2019)
Finance Anomaly 1 Han and Kim (2019)
detection
Business 1 J.-H. Chen et al. (2020)
management
Recommender 1 He et al. (2015)
system
Genetics Prediction 2 Bonidia et al. (2020) and Huang et al. (2020)
Modelling gene 1 Anguita-Ruiz et al. (2020)
relationship
Judicial Decision 3 Vlek et al. (2016), Loyola-González (2019), and Zhong et al.
system support (2019)
Aviation Automated 1 Keneni et al. (2019)
manoeuvring
Predictive 1 ten Zeldam et al. (2018)
maintenance
Architec- Recommender 1 Eisenstadt et al. (2018)
ture system
Construct- Recommender 1 Anysz et al. (2020)
ion system
Culture Recommender 1 Díaz-Rodríguez and Pisoni (2020)
system
Defence Simulation 1 van Lent et al. (2004)
Geology Recommender 1 Segura et al. (2019)
system
139
Table B.4: (Continued) List of references to selected articles published on the methods
of XAI from different application domains for the corresponding tasks. © 2022 by Islam
et al. (CC BY 4.0).

Task Count
Network Supervised 1 Callegari et al. (2021)
tasks
Security Facial 1 Sarathy et al. (2020)
recognition
Telecomm- Goal-driven 1 Ferreyra et al. (2019)
unication simulation
Application Application
Tasks Domains
Figure B.8: Chord diagram (Tintarev et al., 2018) presenting the number of selected
articles published on the XAI methods and evaluation metrics from different application
domains for the corresponding tasks. © 2022 by Islam et al. (CC BY 4.0).
B.5.3 Development of XAI in Different Application Domains

This section briefly describes the concepts of XAI stated in Section B.2 from the
perspective of different application domains. Figure B.9 illustrates the number of
140
Paper B
articles selected from different application domains and further clustered the number
of articles in terms of AI/ML model types, stage, scope, and form of explanations.
In the following subsections, shreds of evidence of linkage between the application
domains and concepts of XAI are presented.
Application Domain Model Type Stage Scope Form
Figure B.9: Number of the selected articles published from different application
domains and clustered on the basis of AI/ML model type, stage, scope, and form of
explanations. The number of articles with each of the properties is given in parentheses.
B.5.3.1 Input Data

The selected articles presented diverse XAI models that can train on different forms
of input data corresponding to the primary tasks and application domain. Figure
B.10 illustrates the use of different input data types with a Venn diagram depicting
the number of articles for each type. The basic types of input data used in the
proposed methods were vectors containing numbers, images, and texts. However,
the use of sensor signals and graphs were also observed but in low numbers. Some of
the works considered diverse forms of data altogether, such as the works of Ribeiro et
al. (2018), Alonso, Toja-Alamancos, et al. (2020), and Lundberg et al. (2020), who
proposed methods that can deal with the input types, images, texts, and vectors.
Another proposed method was developed to learn on graphs and vectors containing
numbers (Segura et al., 2019). In addition to the mentioned forms of input data,
a specialised form of input data was observed, namely the logic scoring preference
(LSP) criteria (Dujmović, 2020), which was later counted as numbers due to apparent
similarity.
141
Graphs and Vectors (1) (42) Images
Graphs (2) (6) Images and Texts
Images and Vectors (3) (1) LSP Criteria
Vectors (61) (5) Texts
Images, Texts and Vectors (3) (10) Sensor Signals
Figure B.10: Venn diagram with the number of articles using different forms of
data to assess the functional validity of the proposed XAI methodologies. The sizes
of the circles are approximately proportional to the number of articles (shown within
parentheses) that were observed in this review study. © 2022 by Islam et al. (CC BY
4.0).
Table B.5: Different models used to solve the primary task of classification or
regression and their study count. © 2022 by Islam et al. (CC BY 4.0).
Model Model Count References

Types
Neural ApparentFlow-net; Convolutional 63 Féraud and Clérot (2002), Bach et al.
Networks Neural Network (CNN); Deep (2015), Hendricks et al. (2016), Bonanno
(NNs) Neural Network (DNN); Deep et al. (2017), Kim and Canny (2017),
Reinforcement Learning (DRL); Lundberg and Lee (2017), Montavon et
Explainable Deep Neural Network al. (2017), Bharadhwaj and Joshi (2018),
(xDNN); Explainable Neural ten Zeldam et al. (2018), Wachter et al.
Network (ExNN); Global–Local (2018), Weber et al. (2018), Q. Zhang et
Capsule Networks (GLCapsNet); al. (2018), Assaf and Schumann (2019), de
GoogleLeNet; Gramian Angular Sousa et al. (2019), García-Magariño et al.
Summation Field CNN (2019), Guidotti, Monreale, Ruggieri, et al.
(GASF-CNN); Hopfield Neural (2019), Han and Kim (2019), Kwon et al.
Networks (HNN); (2019), Nowak et al. (2019), Oramas et
Knowledge-Aware Path Recurrent al. (2019), Rehse et al. (2019), X. Wang
Network; Knowledge-Shot et al. (2019), Zhao et al. (2019), Zheng
Learning (KSL); LeNet-5; Locally et al. (2019), Angelov and Soares (2020),
Guided Neural Networks (LGNN); Anysz et al. (2020), Apicella et al. (2020),
Long/Short-Term Memory Brunese et al. (2020), J.-H. Chen et al.
(LSTM); LVRV-net; MatConvNet; (2020), Chou et al. (2020), H.-Y. Chen and
Multilayer Perceptrons (MLP); Lee (2020), Csiszár et al. (2020), De et al.
Nilpotent Neural Network (NNN); (2020), Dindorf et al. (2020), Graziani et
Recurrent Neural Network al. (2020), Lauritsen et al. (2020), Le et
(RNN); Region-Based CNN al. (2020), Martínez-Cebrián et al. (2020),
(RCNN); RestNet; ROI-Net; Murray et al. (2020), Panigutti et al.
Temporal Convolutional Netwrok (2020), Pintelas et al. (2020), Ponn et al.
(TCN); VGG-19; YOLO. (2020), Poyiadzi et al. (2020), Rio-Torto et
al. (2020), Riquelme et al. (2020), Sarathy
et al. (2020), Selvaraju et al. (2020), Sun
et al. (2020), Tabik et al. (2020), Tan et
al. (2020), Chandrasekaran et al. (2021),
Y.-J. Jung et al. (2021), Lorente et al.
(2021), Moradi and Samwald (2021), Porto
et al. (2021), Rubio-Manzano et al. (2021),
S. C.-H. Yang et al. (2021), W. Yang et al.
(2022), and K. Zhang et al. (2022)
142
Paper B
Table B.5: (Continued) Different models used to solve the primary task of
classification or regression and their study count. © 2022 by Islam et al. (CC BY
4.0).

Types
Ensemble Adaptive Boosting (AdaBoost); 21 Laugel et al. (2018), Lundberg et al.
Models Explainable Unsupervised (2018), Plumb et al. (2018), ten Zeldam
(EMs) Decision Trees (eUD3.5); eXtreme et al. (2018), Carletti et al. (2019),
Gradient Boosting (XGBoost); Guidotti, Monreale, Giannotti, et al.
Gradient Boosting Machines (2019), Loyola-González et al. (2020),
(GBM); Isolation Forest (IF); D. Wang et al. (2019), Anysz et al.
Random Forest (RF); Random (2020), Dindorf et al. (2020), Fernández
Shapelet Forest (RSF). et al. (2020), Hatwell et al. (2020, 2021),
Karlsson et al. (2020), Lin et al. (2020),
Loyola-González (2019), Serradilla et al.
(2020), La Gatta et al. (2021a, 2021b),
Moradi and Samwald (2021), and Sarp et
al. (2021)
Tree- Classification and Regression Tree 10 Ribeiro et al. (2016a), Shalaeva et al.
Based (CART); Conditional Inference (2018), Alonso, Ducange, et al. (2020),
Models Tree (CTree); Decision Tree (DT); Alonso, Toja-Alamancos, et al. (2020),
(TB) Fast and Frugal Trees (FFTs); Anysz et al. (2020), Cao et al. (2020), Itani
Fuzzy Hoeffding Decision Tree et al. (2020), Lindsay et al. (2020), Pintelas
(FHDT); J48; One-Class Tree et al. (2020), and Porto et al. (2021)
(OCTree); Multi-Operator
Temporal Decision Tree (MTDT);
Recursive Partitioning and
Regression Trees (RPART).
Fuzzy Big Bang–Big Crunch Interval 09 Magdalena (2018), Pierrard et al. (2018),
Models Type-2 Fuzzy Logic System Ramos-Soto and Pereira-Fariña (2018),
(FMs) (BB-BC IT2FLS); Constrained Ferreyra et al. (2019), Rutkowski et al.
Interval Type-2 Fuzzy System (2019), Sabol et al. (2019), Alonso,
(CIT2FS); Cumulative Fuzzy Toja-Alamancos, et al. (2020), and
Class Membership Criterion D’Alterio et al. (2020)
(CFCMC); Fuzzy Unordered Rule
Induction Algorithm (FURIA);
Hierarchical Fuzzy Systems
(HFS); Multi-Objective
Evolutionary Fuzzy Classifiers
(MOEFC); Wang–Mendal
Algorithm of Fuzzy Rule
Generation (WM Algorithm).
Support SVM with Linear and Radial 08 Bach et al. (2015), Ribeiro et al. (2016a),
Vector Basis Function (RBF) Kernels. Laugel et al. (2018), ten Zeldam et al.
Machines (2018), Guidotti, Monreale, Giannotti, et
(SVMs) al. (2019), Pintelas et al. (2020), Serradilla
et al. (2020), Callegari et al. (2021), La
Gatta et al. (2021a, 2021b), and Moradi
and Samwald (2021)
143
Table B.5: (Continued) Different models used to solve the primary task of
classification or regression and their study count. © 2022 by Islam et al. (CC BY
4.0).

Types
Unsorted Cartesian Genetic Programming 07 He et al. (2015), Senatore et al. (2019),
Models (CGP); Computational Zhong et al. (2019), Anguita-Ruiz et al.
(UMs) Argumentation (CA); Logic (2020), Dujmović (2020), Kouki et al.
Scoring of Preferences (LSP); (2020), and Lamy et al. (2020)
Preference Learning (PL);
Probabilistic Soft Logic (PSL);
Sequential Rule Mining (SRM);
TriRank.
Linear Linear Discriminant Analysis 06 Ribeiro et al. (2016a), Zheng et al. (2019),
Models (LDA); Logistic Regression (LgR); Anysz et al. (2020), A. Jung and Nardelli
(LMs) Linear Regression (LnR). (2020), and Pintelas et al. (2020)
Nearest k-Nearest Neighbours (kNN); 06 Ribeiro et al. (2016a), ten Zeldam et al.
Neighbou- Distance-Weighted kNN (WkNN). (2018), Weber et al. (2018), Karlsson et al.
rs Models (2020), Pintelas et al. (2020), and Serradilla
(NNMs) et al. (2020)
Neuro- Adaptive Network-Based Fuzzy 05 Aghamohammadi et al. (2019), Keneni et
Fuzzy Inference System (ANFIS); al. (2019), M. A. Islam et al. (2020), Soares
Models Improved Choquet Integral et al. (2020), and Yeganejou et al. (2020)
(NFMs) Multilayer Perceptron (iChIMP);
LeNet with Fuzzy Classifier;
Mamdani Fuzzy Model;
Sugeno-Type Fuzzy Inference
System; Zero-Order Autonomous
Learning Multiple-Model
(ALMMo-0*).
Case- CBR-kNN; CBR-WkNN; 04 Massie et al. (2005), Eisenstadt et al.
based CBR-PRVC (Pattern Recognition, (2018), Lamy et al. (2019), and van der Waa
Reasoning Validation and Contextualisation) et al. (2020)
(CBR) Methodology.
Bayesian Bayesian Network (BN); Bayesian 03 Letham et al. (2015), Vlek et al. (2016), and
Models Rule List (BRL); Gaussian Naive Serradilla et al. (2020)
(BM) Bayes Classifier/Regressor
(GNBC/GNBR).
B.5.3.2 Models for Primary Tasks

The majority of the applications built on the concepts of AI perform two basic types
of tasks, i.e., supervised (classification and regression) and unsupervised (clustering)
tasks which have undoubtedly remained unchanged in the XAI domain. The authors
of the selected articles used different established AI/ML models depending on the
tasks. The methods were clustered based on the basic type of the models, specifically,
neural network (NN), ensemble model (EM), Bayesian model (BM), fuzzy model
(FM), tree-based model (TM), linear model (LM), nearest neighbour model (NNM),
support vector machine (SVM), neuro-fuzzy model (NFM), and case-based reasoning
(CBR). Works related to these models were clustered on the basis of their types and
are presented in Table B.5. Moreover, the table contains the names of different
variants of the AI/ML models references to the articles featuring the models, and
the number of studies performed. It was observed that neural network-based
144
Paper B
models were exploited in most of the studies (63) from the selected articles. The
second-highest number of studies (21) utilised the ensemble techniques for performing
the primary supervised or unsupervised tasks. Based on the increased interest of
researchers in neural networks and ensemble techniques, it can be inevitably assumed
that these models were chosen to incorporate explainability because of their wide
acceptability over various domains in terms of their performances. In addition to
the renowned algorithms, there are some other algorithms, such as probabilistic soft
logic (PSL) (Kouki et al., 2020), LSP (Dujmović, 2020), sequential rule mining (SRM)
(Anguita-Ruiz et al., 2020), preference learning (Lamy et al., 2020), Cartesian genetic
programming (CGP) (Senatore et al., 2019), Predomics (Prifti et al., 2020), and
TriRank (He et al., 2015). The acronyms of the model types are further referenced
in Table B.6 to indicate their relation to the core AI/ML models.
Throughout this study, it was evident that most of the research works were
domain-agnostic. For specific domains, healthcare, industry, and transportation were
revealed to be more exploited than other domains. In these domains, as stated
above, diverse forms of neural networks had been invoked to perform different tasks
(see Figure B.9) followed by other types of models, as listed in Table B.5. The
numbers associated with different model types stated in Figure B.9 and Table B.5
varied because the illustration presents the number of articles and the table lists the
number of variations of the models. It was observed that in some articles, the authors
presented theirs using different models of similar types.
Properties of Explainable Models
Stage Scope Explanation
Ante-hoc (40) Numeric (10)
Post-hoc (88) Rule-based (17)
Textual (14)
Visualisation (52)
Mixed (35)
Figure B.11: Distribution of the selected articles based on the stage, scope, and
form of explanations. The number of articles with each of the properties is given in
parentheses. © 2022 by Islam et al. (CC BY 4.0).
B.5.3.3 Methods for Explainability

The available methods for adding explainability to the existing and proposed AI/ML
models were initially clustered on the basis of three properties: (i) the stage of
generating an explanation; (ii) the scope of the explanation; and (iii) the form of the
explanation. Figure B.11 illustrates the number of articles presenting research works
concerning each of the properties. The summary of the clustering is represented in
Table B.6 where model-specific methods are cross-referenced to the model types
described in Section B.5.3.2. A good number of model-agnostic (MA) methods
145
were also deployed to provide explainability in the selected articles of this review,
such as Anchors (Ribeiro et al., 2018), Explain Like I’m Five (ELI5) (Serradilla et
al., 2020), Local Interpretable Model-Agnostic Explanations (LIME) (Ribeiro et al.,
2016a), and Model Agnostic Supervised Local Explanations (MAPLE) (Plumb et al.,
2018). LIME was modified and proposed as SurvLIME by Kovalev and Utkin (2020).
Afterwards, the authors incorporated well-known Kolmogorov–Smirnov bounds to
SurvLIME and proposed SurvLIME-KS (Kovalev et al., 2020). The authors also
utilised feature importance to generate numeric explanations in several research works
(Štrumbelj & Kononenko, 2014; Rehse et al., 2019; Anysz et al., 2020; Pintelas et al.,
2020). The Shapley Additive Explanations (SHAP) was proposed by Lundberg and
Lee (2017), and it was later used by several authors to generate mixed explanations
containing numbers, texts, and visualisations (D. Wang et al., 2019; Ponn et al.,
2020). However, another variant of SHAP, Deep-SHAP, was proposed to explicitly
explain deep learning models. Two very recent studies proposed Cluster-Aided
Space Transformation for Local Explanation (CASTLE) (La Gatta et al., 2021a)
and Pivot-Aided Space Transformation for Local Explanation (PASTLE) (La Gatta
et al., 2021b). The authors claimed that a higher quality of local explanations can
be generated with these methods than with the prevailing methods for unsupervised
and supervised tasks, respectively.
In terms of application domains, post hoc techniques are more developed for
producing explanations at the local scope. One can see in the illustration of Figure
B.9 that the majority of the post hoc techniques were developed for complex models
such as neural networks and ensemble models. On the other hand, most of the
ante hoc techniques are associated with fuzzy and tree-based models across all the
application domains.
B.5.3.4 Forms of Explanation

This section presents the different forms of explanations that have been added to
different AI/ML models. From the selected articles, it was observed that mostly
four different forms of explanations were generated to explain the decisions of the
models as well as the process of deducing a decision. The forms of explanations are
numeric, rules, textual, and visualisation. Figure B.12 illustrates the basic forms of
explanations. In some of the works, the authors used these forms in a combined
fashion to make the explanation more understandable and user friendly. All the
forms of explanation are discussed along with the references to key works with the
corresponding forms in the subsequent paragraphs.
Numeric Explanations. Numeric explanations are mostly generated by the

models by measuring the contribution of the input variables for the model’s outcome.
The contribution is represented by various measures, such as the confidence measures
of features (Moradi & Samwald, 2021) illustrated in Figure B.12a, saliency, causal
importance (Féraud & Clérot, 2002), feature importance (Rehse et al., 2019; Anysz
et al., 2020), and mutual importance (A. Jung & Nardelli, 2020). M. A. Islam et al.
(2020) improvised the MLP with the Choquet integral to add numeric explanations
within both the local and global scope. Sarathy et al. (2020) computed and compared
the quadratic mean among the instances to generate the decision with explanations.
146
Paper B
(a) (b)
(c) (d)
Figure B.12: Different forms of explanations: (a) numeric explanation of remaining

life estimation in industry appliances (Moradi & Samwald, 2021); (b) visual explanation
for fault diagnosis of industrial equipment by Sun et al. (2020); (c) example of rule-based
explanation in the form of a tree (Lindsay et al., 2020); and (d) explanation text
generated with GRACE, proposed by Le et al. (2020). © 2022 by Islam et al. (CC
BY 4.0).
Carletti et al. (2019) used depth-based isolation forest feature importance (DIFFI) to
support the decisions from depth-based isolation forests (IFs) in anomaly detection
for industrial applications, and the FDE measure was developed to add precise
explainability for failure diagnosis in automated industries (ten Zeldam et al., 2018).
Moreover, several model-agnostic tools generate numeric explanations, e.g., Anchors
(Ribeiro et al., 2018), ELI5, LIME (Serradilla et al., 2020), SHAP (Ponn et al., 2020),
and LORE (D. Wang et al., 2019). Moreover, Table B.6 contains additional examples
of numeric explanations, and the methods are clustered on the basis of stage and the
scope of explanations. However, the numeric explanations demand high expertise in
the corresponding domains as they are associated with the features. This assumption
supports the low number of studies on numeric explanations, as shown in Figure B.9.
Rule-Based Explanations Rule-based explanations illustrate a model’s

decision-making process in the form of a tree or list. Figure B.12c demonstrates
an example of a rule-based explanation. Largely, the models producing rule-based
explanations generate explanations with a global scope, i.e., of the whole model.
De et al. (2020) proposed the existing TREPAN decision tree as a surrogate model
with an FFNN to generate rules depicting the flow of information within the
neural network. Rutkowski et al. (2019) used the Wang–Mendal (WM) algorithm
to generate fuzzy rules to support recommendations with explanations. A novel
147
neuro-fuzzy system, ALMMo-0*, was proposed by Soares et al. (2020). In addition,

model-specific methods have been proposed to generate rule-based explanations
such as eUD3.5, an explainable version of UD3.5 (Loyola-González et al., 2020)
and Ada-WHIPS to support the AdaBoost ensemble method (Hatwell et al., 2020).
More methods generating rule-based explanations are listed in Table B.6. The
rule-based explanations are much simpler in nature than the numeric explanations
that facilitate this type of explanation in supporting recommendation systems
developed for general users from domains such as entertainment and finance.
Textual Explanations The use of textual explanations is found to be least

common among all forms of explanations due to their higher computational
complexity which requires natural language processing. The textual explanations
are mostly generated at the local scope, i.e., for an individual decision. In notable
works, textual explanations were generated using counterfactual sets (Wachter et al.,
2018; Fernández et al., 2020), template-based natural language generation (Zhong et
al., 2019), etc. Weber et al. (2018) proposed textual CBR (TCBR) utilising patterns
of input-to-output relations in order to recommend citations for academic researchers
through textual explanations. Unlike TCBR, interpretable confidence measures were
used by van der Waa et al. (2020) with CBR to generate textual explanations. Le et
al. (2020) proposed GRACE which can generate intuitive textual explanations along
with the decision. The textual explanations generated with GRACE were revealed
to be more understandable by humans in synthetic and real experiments. Moreover,
textual explanations are found to be generated at the local scope (see Figure B.9)
and these explanations are associated with academic research, judicial systems, etc.
Table B.6 lists several other proposed methods to generate textual explanations.
Visual Explanations The most common form of explanation was found to

be visualisations, as shown in Table B.6. With respect to the stage of adding
explanations, in the majority of the cases, visual explanations in both the local
and global scopes were generated using post hoc techniques and the research studies
were carried out as domain-agnostic and from the healthcare domain (see Figure
B.9). Common visualisation techniques are class activation maps (CAM) (Assaf
& Schumann, 2019; Sun et al., 2020) and attention maps (Kim & Canny, 2017;
Riquelme et al., 2020). CAM was further extended with gradient weights, and
Grad-CAM was proposed by Selvaraju et al. (2020). Brunese et al. (2020) used
Grad-CAM to detect COVID-19 infection based on X-rays. Han and Kim (2019)
adopted another form pGrad-CAM to provide an explanation for banknote fraud
detection. Heatmaps of salient pixels were used by Graziani et al. (2020) as a
complement to the concept-based explanation. They proposed a framework of
concept attribution for deep learning to quantify the contribution of features of
interest to the deep network’s decision making (Graziani et al., 2020). In addition,
several explanation techniques were proposed with attribution-based visualisations,
such as Multi-Operator Temporal Decision Trees (MTDTs) (Shalaeva et al., 2018),
Layerwise Relevance Propagation (LRP) (Bach et al., 2015), Selective LRP (SLRP)
(Y.-J. Jung et al., 2021), etc. The Rainbow Boxes-Inspired Algorithm (RBIA) was
extensively used by Lamy et al. (2019) and Lamy et al. (2020) in different decision
support tasks within the healthcare domain. Specialised methodologies have also
148
Paper B
been developed by researchers from diverse domains to add visual explanations to

the outcomes of different AI/ML models such as iNNvestigate (Lauritsen et al.,
2020), non-negative matrix factorisation (NMF) (Oramas et al., 2019), candlestick
plots (J.-H. Chen et al., 2020), and sequential rule mining (SRM) (Anguita-Ruiz
et al., 2020). In addition to the methodologies mentioned above, Table B.6 contains
additional methods to add visual explanations to different types of AI/ML models.
Table B.6: Methods for explainability, stage (Ah: ante-hoc, Ph: post-hoc) and scope
(L: local, G: global) of explainability, forms of explanations (N : numeric, R: rules, T :
textual, V : Visual) and the type of models used for performing the primary tasks (refer
to Table B.5 for the elaborations of the model types). © 2022 by Islam et al. (CC BY
4.0).
Methods for Stage Scope Form Models for

References
Explainability Ah Ph L G N R T V Primary Task
Ada-WHIPS Hatwell et al. (2020) ✓ ✓ ✓ EM
ALMMo-0* Soares et al. (2020) ✓ ✓ ✓ ✓ NFM
Anchors Ribeiro et al. (2018) ✓ ✓ ✓ ✓ ✓ ✓ ✓ MA
ANFIS Bonanno et al. (2017), ✓ ✓ ✓ ✓ ✓ ✓ ✓ FM; GA; NN;
Aghamohammadi et al. (2019),
Keneni et al. (2019), and H.-Y.
Chen and Lee (2020)
ApparentFlow- Zheng et al. (2019) ✓ ✓ ✓ NN
net
Attention Maps Kim and Canny (2017), Nowak ✓ ✓ ✓ NN
et al. (2019), Martínez-Cebrián
et al. (2020), and Riquelme et
al. (2020)
BB-BC IT2FLS Ferreyra et al. (2019) ✓ ✓ ✓ ✓ ✓ FM
BEN Chandrasekaran et al. (2021) ✓ ✓ ✓ NN
BN Vlek et al. (2016) and S. C.-H. ✓ ✓ ✓ ✓ BM
Yang et al. (2021)
BRL Letham et al. (2015) ✓ ✓ ✓ BM
CAM Assaf and Schumann (2019) and ✓ ✓ ✓ NN
Sun et al. (2020)
Candlestick J.-H. Chen et al. (2020) ✓ ✓ ✓ NN
Plots
CART Lindsay et al. (2020) ✓ ✓ ✓ TM
CASTLE La Gatta et al. (2021a) ✓ ✓ ✓ ✓ ✓ MA
Causal Féraud and Clérot (2002) ✓ ✓ ✓ NN
Importance
CFCMC Sabol et al. (2019) ✓ ✓ ✓ FM
CGP Senatore et al. (2019) ✓ ✓ ✓ UM
CIE Moradi and Samwald (2021) ✓ ✓ ✓ ✓ ✓ EM; NN; SVM
CIT2FS D’Alterio et al. (2020) ✓ ✓ ✓ ✓ FM
Concept Graziani et al. (2020) ✓ ✓ ✓ ✓ NN
Attribution
Counterfactual Wachter et al. (2018) and ✓ ✓ ✓ EM; NN
Sets Fernández et al. (2020)
149
Table B.6: (Continued) Methods for explainability, stage (Ah: ante-hoc, Ph:
post-hoc) and scope (L: local, G: global) of explainability, forms of explanations (N :
numeric, R: rules, T : textual, V : Visual) and the type of models used for performing
the primary tasks (refer to Table B.5 for the elaborations of the model types). © 2022

References
CTree Lindsay et al. (2020) and Porto ✓ ✓ ✓ TM
et al. (2021)
DeconvNet Oramas et al. (2019) ✓ ✓ ✓ ✓ NN
Decision Tree Cao et al. (2020) and Dutta and ✓ ✓ ✓ ✓ NN; TM
Zielińska (2020)
Deep-SHAP K. Zhang et al. (2022) ✓ ✓ ✓ ✓ ✓ MA
DTD Montavon et al. (2017) ✓ ✓ ✓ NN
DIFFI Carletti et al. (2019) ✓ ✓ ✓ EM
ELI5 Serradilla et al. (2020) and Sarp ✓ ✓ ✓ ✓ ✓ MA
et al. (2021)
Encoder- Rio-Torto et al. (2020) ✓ ✓ ✓ NN
Decoder
eUD3.5 Loyola-González et al. (2020) ✓ ✓ ✓ ✓ EM
ExNN Z. Yang et al. (2021) ✓ ✓ ✓ ✓ NN
FACE Poyiadzi et al. (2020) ✓ ✓ ✓ NN
FDE ten Zeldam et al. (2018) ✓ ✓ ✓ ✓ EM; NN; NNM;
SVM
Feature Štrumbelj and Kononenko ✓ ✓ ✓ ✓ ✓ ✓ MA
Importance (2014), Rehse et al. (2019),
Anysz et al. (2020), and
Pintelas et al. (2020)
Feature Pattern Loyola-González (2019) ✓ ✓ ✓ EM
FFT Lindsay et al. (2020) ✓ ✓ ✓ TM
FINGRAM Alonso, Ducange, et al. (2020) ✓ ✓ ✓ TM
FormuCaseViz Massie et al. (2005) ✓ ✓ ✓ CBR
FURIA Alonso, Toja-Alamancos, et al. ✓ ✓ ✓ FM
(2020)
Fuzzy LeNet Yeganejou et al. (2020) ✓ ✓ ✓ FM
Fuzzy Relations Pierrard et al. (2018) and ✓ ✓ ✓ FM
Ramos-Soto and Pereira-Fariña
(2018)
gbt-HIPS Hatwell et al. (2021) ✓ ✓ ✓ ✓ EM
Generation Zhao et al. (2019) ✓ ✓ ✓ NN
GLAS Oh et al. (2020) ✓ ✓ ✓ MA
GRACE Le et al. (2020) ✓ ✓ ✓ NN
Grad-CAM Biswas et al. (2020), Brunese et ✓ ✓ ✓ NN
al. (2020), H.-Y. Chen and Lee
(2020), Selvaraju et al. (2020),
Tabik et al. (2020), Schönhof
et al. (2021), and Schorr et al.
(2021)
Growing Laugel et al. (2018) ✓ ✓ ✓ ✓ EM; SVM
Spheres
150
Paper B

References
HFS Magdalena (2018) ✓ ✓ ✓ ✓ FM
iChIMP M. A. Islam et al. (2020) ✓ ✓ ✓ ✓ NFM
ICM van der Waa et al. (2020) ✓ ✓ ✓ CBR
iNNvestigate Lauritsen et al. (2020) ✓ ✓ ✓ ✓ NN
Interpretable Q. Zhang et al. (2018) ✓ ✓ ✓ NN
Filters
J48 Alonso, Toja-Alamancos, et al. ✓ ✓ ✓ ✓ TM
(2020), Bonidia et al. (2020),
and Lindsay et al. (2020)
Knowledge X. Wang et al. (2019) ✓ ✓ ✓ ✓ NN
Graph
KSL Chou et al. (2020) ✓ ✓ ✓ ✓ NN
LEWIS Galhotra et al. (2021) ✓ ✓ ✓ ✓ ✓ MA
LGNN Tan et al. (2020) ✓ ✓ ✓ NN
LIME Ribeiro et al. (2016a), de Sousa ✓ ✓ ✓ ✓ ✓ MA
et al. (2019), Dindorf et al.
(2020), Serradilla et al. (2020),
and Schönhof et al. (2021)
LORE Guidotti, Monreale, Giannotti, ✓ ✓ ✓ EM; NN; SVM
et al. (2019) and D. Wang et al.
(2019)
LPS Murray et al. (2020) ✓ ✓ ✓ NN
LRP Bach et al. (2015) ✓ ✓ ✓ NN; SVM
LRCN Hendricks et al. (2016) ✓ ✓ ✓ NN;
LSP Dujmović (2020) ✓ ✓ ✓ ✓ UM
MAPLE Plumb et al. (2018) ✓ ✓ ✓ ✓ ✓ MA
MTDT Shalaeva et al. (2018) ✓ ✓ ✓ TM
Mutual A. Jung and Nardelli (2020) ✓ ✓ ✓ LM
Importance
MWC, MWP García-Magariño et al. (2019) ✓ ✓ ✓ ✓ NN
Nilpotent Logic Csiszár et al. (2020) ✓ ✓ ✓ ✓ ✓ NN
Operators
NLG Rubio-Manzano et al. (2021) ✓ ✓ ✓ ✓ ✓ NN
NMF Q. Zhang et al. (2018) ✓ ✓ ✓ NN
OC-Tree Itani et al. (2020) ✓ ✓ ✓ TM
Ontological Panigutti et al. (2020) ✓ ✓ ✓ NN
Perturbation
PAES-RCS Callegari et al. (2021) ✓ ✓ ✓ FM
PASTLE La Gatta et al. (2021b) ✓ ✓ ✓ ✓ ✓ MA
pGrad-CAM Han and Kim (2019) ✓ ✓ ✓ NN
Prescience Lundberg et al. (2018) ✓ ✓ ✓ EM
PRVC Eisenstadt et al. (2018) ✓ ✓ ✓ ✓ ✓ ✓ CBR
151

References
PSL Kouki et al. (2020) ✓ ✓ ✓ ✓ UM
QMC Sarathy et al. (2020) ✓ ✓ ✓ NN
QSAR Huang et al. (2020) ✓ ✓ ✓ NN
RAVA Segura et al. (2019) ✓ ✓ ✓ MA
RBIA Lamy et al. (2019) and Lamy et ✓ ✓ ✓ ✓ CBR
al. (2020)
RetainVis Kwon et al. (2019) ✓ ✓ ✓ ✓ ✓ ✓ NN
RISE Li et al. (2020) ✓ ✓ ✓ NN
RPART Porto et al. (2021) ✓ ✓ ✓ TM
RuleMatrix Ming et al. (2019) ✓ ✓ ✓ MA
Saliency Féraud and Clérot (2002) ✓ ✓ ✓ NN
SHAP Lundberg and Lee (2017), D. ✓ ✓ ✓ ✓ ✓ MA
Wang et al. (2019), Hong et
al. (2020), Lin et al. (2020),
Ponn et al. (2020), Serradilla et
al. (2020), and Hu and Beyeler
(2021)
Shapelet Karlsson et al. (2020) ✓ ✓ ✓ EM
Tweaking
SLRP Y.-J. Jung et al. (2021) ✓ ✓ ✓ NN
SRM Anguita-Ruiz et al. (2020) ✓ ✓ ✓ ✓ UM
SurvLIME-KS Kovalev and Utkin (2020) ✓ ✓ ✓ ✓ ✓ MA
TCBR Weber et al. (2018) ✓ ✓ ✓ CBR
Template- Zhong et al. (2019) ✓ ✓ ✓ UM
Based Natural
Language
Generation
Time-Varying Bharadhwaj and Joshi (2018) ✓ ✓ ✓ ✓ NN
Neighbourhood
TreeExplainer Lundberg and Lee (2017) ✓ ✓ ✓ ✓ ✓ MA
TREPAN De et al. (2020) ✓ ✓ ✓ NN
Tripartite He et al. (2015) ✓ ✓ ✓ ✓ UM
Graph
WM Algorithm Rutkowski et al. (2019) ✓ ✓ ✓ FM
xDNN Soares et al. (2020) ✓ ✓ ✓ NN
XRAI Lorente et al. (2021) ✓ ✓ ✓ NN
B.5.3.5 Evaluation of Explainability

The development of methodologies or definitions of metrics to evaluate the
explanation generation techniques as well as to assess the quality of the generated
explanations is comparatively lower than the extreme increase in research works
152
Paper B
Figure B.13: UpSet plot presenting the distribution of different methods of evaluating
the explainable systems. The vertical bars in the bottom-left represent the number
of studies conducting each of the methods. The single and connected black circles
represent the combination of the evaluation methods and the horizontal bars illustrate
their number of studies. © 2022 by Islam et al. (CC BY 4.0).
devoted to exploring new methodologies of XAI. In this study, only nine articles
among the selected articles were found to be fully intended for the evaluation
of and metrics for XAI. However, all the articles proposing new methods to add
explainability considered one of the three techniques to assess their explainable
model or the explanations generated by the models. These techniques were (i) user
studies; (ii) synthetic experiments; and (iii) real experiments. The number of studies
adopting each of the techniques are illustrated in Figure B.13. It was observed that
most of the studies invoked user studies and synthetic experiments as standalone
methods for evaluating the proposed explainable systems. Very few studies only
used real experiments to evaluate their proposed systems. However, several studies
conducted a combination of the user studies, real and synthetic experiments in the
evaluation process as illustrated in the UpSet plot in Figure B.13. User studies were
mostly performed to evaluate the quality of the generated explanation in the form
of case studies and questionnaire surveys. Generally, these cases are formulated by
the researchers combining a real or synthetic scenario that is associated with some
prediction/classification output and its explanation in any of the forms presented in
Section B.5.3.4. The surveys were observed to be conducted among the respective
domain experts. They had to answer questions on the understandability and quality
of the explanations from the presented case studies. To facilitate the user studies,
Holzinger et al. (2020) proposed the System Causability Scale (SCS) to measure the
quality of explanations. In simpler terms, the SCS resembles the widely known Likert
scale (Albaum, 1997). In earlier work, Chander and Srinivasan (2018) introduced the
notion of the cognitive value of an explanation and related its function in generating
significant explanations within a given setting. Lage et al. (2018) proposed the
methodology of a user study to measure the human-interpretability of logic-based
explanations. The prime metrics were the response time for understanding, the
accuracy of understanding, and the subjective satisfaction of the users. Ribeiro et al.
153
(2016a) explicitly conducted a simulated user experiment to address the following

questions: (1) Are the explanations faithful to the model? (2) Can the explanations
aid users to ascertain trust in predictions? and (3) Are the explanations useful for
evaluating the model as a whole? They also involved human subjects in evaluating
the explanations generated by LIME and SP-LIME within the following situations
(Ribeiro et al., 2016b): (1) whether users can choose a better classifier in terms of
generalisation; (2) whether the users can perform feature engineering to improve the
model; and (3) whether the users are capable of pointing out the irregularities of a
classifier by observing the explanations.
Application Domain Evaluation Method Application Task
Figure B.14: Different methods of evaluating explanations, which were presented in

the selected articles with the number of studies given in parentheses. Corresponding
application domains and tasks of the performed evaluation methods are illustrated with
links. The widths of the links are proportional to the number of studies. Some of the
studies invoked a combination of different evaluation methods. © 2022 by Islam et al.
(CC BY 4.0).
Different types of experiments with real and synthetic data were performed to
quantify various metrics for the generated explanations to evaluate the quality of
the explanations. Vilone and Longo (2021b) proposed two types of evaluation
methods for assessing the quality of the explanations; objective and human-centred.
Human-centred methods are mostly performed through user studies as discussed
earlier. The prominent objective measures are briefly stated here. Guidotti,
Monreale, Ruggieri, et al. (2019) used fidelity, l-fidelity, and hit scores and proposed
the use of the Jaccard measure of stability, the number of falsified conditions in
counterfactual rules, the rate of the agreement of black-box and counterfactual
decisions for counterfactual instances, F1-score of agreement of black-box and
counterfactual decisions, etc. In another work, stability was proposed as an
objective function that acts as an inhibitor to include too many terms in the textual
154
Paper B
explanations (Hatwell et al., 2020). To evaluate the visual explanations, Bach et al.
(2015) proposed a pixel-flipping method that enables users to discriminate between
two heatmaps. Moreover, sentence evaluation metrics, such as METEOR and CIDEr
were used to evaluate textual explanations associated with visualisations (Hendricks
et al., 2016). Samek et al. (2017) proposed the Area over the MoRF (Most Relevant
First) Curve (AOPC) to measure the impact on classification performance when
generating a visual explanation. In the proposition, the authors illustrated that a
large AOPC value provides a good measure for a very informative heatmap. AOPC
can assess the amount of information present in a visual explanation but it lacks
in terms of being able to assess the quality of the understandability of the users.
In another study, Rio-Torto et al. (2020) proposed the Percentage of Meaningful
Pixels Outside the Mask (POMPOM) as another measurable criterion of explanation
quality. POMPOM is defined as the ratio between the number of meaningful pixels
outside the region of interest and the total number of pixels in the image. The
authors have also conducted a comparative study with AOPC and POMPOM. They
concluded that POMPOM generates superior results for the supervised approach
whereas AOPC has the upper hand for the unsupervised approach. Significantly,
Sokol and Flach (2020) provided a comprehensive and representative taxonomy and
associated descriptors in the form of a fact sheet with five dimensions that can help
researchers develop and evaluate new explainability approaches.
The associations among the evaluation methods and different application domains
and applications are illustrated in Figure B.14. It can be easily observed that
synthetic experiments and user studies were mostly used to evaluate proposed
explainable systems from the domains of healthcare and industry. Moreover, a
good number of domain-specific studies also utilised the aforementioned evaluation
methods. In terms of specific tasks, user studies were mostly conducted for evaluating
recommender systems. Very few studies have conducted real experiments, which
were found to be from healthcare and industry domains for decision support, image
processing, and predictive maintenance.
B.6 Discussion
The continuously growing interest in the research domain of XAI worldwide resulted
in the publication of a large number of research articles containing diverse knowledge
of explainability from different perspectives. In the published articles, it is often
noticed that similar terms are used interchangeably (Barredo Arrieta et al., 2020),
which is one of the major hurdles for a new researcher to initiate work on developing
a new methodology of XAI. In addition, an “Explainable AI (XAI) Program” by
DARPA (Gunning & Aha, 2019), the Chinese Government’s “The Development Plan
for New Generation of Artificial Intelligence” (Xu et al., 2019) and the GDPR by the
EU (Wachter et al., 2018) escalated the number of research studies during the past
couple of years, as demonstrated in Figure B.1. The literature shows several review
and survey studies on XAI philosophy, taxonomy, methodology, evaluation, etc.
Nevertheless, to our knowledge, no study has been performed that has wholly focused
on the XAI methodologies from the perspective of different application domains and
tasks, let alone following some prescribed technique of conducting literature reviews.
In contrast, this SLR followed a proper guideline (Kitchenham & Charters, 2007)
155
that precisely defines the methodology of surveying the recent developments in XAI
techniques and evaluation criteria. One of the major advantages of an SLR is that the
methodology contains a workflow for reviewing literature by defining and addressing
specific RQs to restrict the subject matter of a study to the scope of the designated
topic. Here, the RQs presented in Section B.4.1.2 were purposefully designed to
review the development and evaluation of XAI methodologies and were addressed
with the presented outcomes of the study listed in Section B.5.
This study started with the task of scanning more than a thousand peer-reviewed
articles from different bibliographic databases. Following the process described in
Section B.4.2.1, 137 articles were thoroughly analysed to summarise the recent
developments. Among the selected articles, 19 were added through the snowballing
search, prescribed by Wohlin (2014). Here, the cited articles in the pre-selected
articles were checked to identify more articles that met this study’s inclusion criteria.
While conducting the snowballing search, some of the articles meeting the inclusion
criteria were found to be published prior to the defined period of 2018 – 2020 in
the inclusion criteria (Table B.1) but were apparently very significant in terms of
content as they were cited in many of the pre-selected articles. Considering the
impact of those articles in developing XAI methodologies, they were included in
the study despite not completely meeting the inclusion criteria. Moreover, during
the screening of articles, some of the articles were unintentionally overlooked due
to the use of the specific keyword searched (explainable artificial intelligence) in the
bibliographic databases. For example, this could be the article in which Spinner et al.
(2020) presented a visual analytics framework for interactive and explainable machine
learning. For some unforeseen reason, the index terms of the article did not contain
the aforementioned search keyword, but the abstract and keywords of the articles
contained the term “Explainable AI”. The interchangeable use of several closely
related terms (e.g., interpretability, transparency, and explainability) in metadata
impedes the proper acquisition of knowledge on XAI. As a result, a few potentially
significant articles were overlooked during this review study. The absence of acquired
knowledge from the neglected articles can be considered a limitation of this SLR.
The selected articles were analysed from five different viewpoints, i.e., metadata,
primary task, explainability, the form of explanation, and the evaluation of methods
and explanations. The prominent features from the respective viewpoints are
summarised in Table B.3. The features and possible alternatives were set in such
a way that the result of the analysis can substantially address the RQs. Section
B.5 presents the outcomes of the analysis by identifying insights into the domains
and applications in which XAI is developing, the prevailing methods of generating
and evaluating explanations, etc. This information is thus readily available for
prospective researchers from miscellaneous domains to instigate research projects
on the methodological development of XAI. In addition, a questionnaire survey was
designed and administered to the authors of the selected articles with several aims:
to cure the extracted feature values from the articles, to assess the credibility of
the definition of the features, etc. The questionnaire was distributed to the authors
through email, and the response rate was approximately 50%. The responses were
apparently similar to the information extracted from the articles, except in a few
cases. For example, from the article, it was found that the input data for the method
developed by Dujmović (2020) were numeric. In contrast, from the author’s response,
156
Paper B
the input data were mentioned as LSP, and this information was incorporated in the
analysis. This instance of curating, clarifying, and cross-checking the information
extracted from the articles advocates the need for a questionnaire survey. This
review study took advantage of the questionnaire survey to assess the credibility
of the literature reviewer as well as clarify the information.
During the exploration of the contents of the sorted-out articles, the first step was
to analyse the metadata. To determine the relevancy of the articles, keywords that
were explicitly defined by the authors and keywords extracted from the abstracts
were investigated in the form of word clouds following the methodology developed
by Helbich et al. (2013). It was observed that the significant terms were explainable
artificial intelligence, deep learning, machine learning, explainability, visualisation
etc. These terms were considered significant due to their larger appearance in
the word cloud, which resulted from repeated occurrences of the terms in the
supplied texts. In addition, a higher number of occurrences of terms, such as deep
learning or visualisation, aligns with the higher number of studies with concepts
presented in Tables B.5 and B.6, indicating tunnel vision in XAI development. More
attention towards less investigated models, such as SVM and neuro-fuzzy models and
visualisation techniques would add more value and novelty towards XAI. Moreover,
the prominent terms are strongly related to the primary concept of this study,
which increases the confidence in the selected articles that they are related. In
addition, the terms from the author-defined keywords were more conceptual than
the terms from the abstracts of the articles. On the other hand, the abstracts
contained more specific terms based on the application tasks and AI/ML models.
From the metadata, the countries of the authors’ affiliations were evaluated, and
it was found that the USA leads by a significant margin in terms of the number
of publications. However, the collective publications from the countries belonging
to the EU exceed the number of publications from the USA. This high number
of publications indicated the immense impact of imposing various regulations and
expressing interest through different programs from different governments. Although
there was a development plan on XAI from the Government of China, the number of
screened articles was lower, and they were published by the authors affiliated with the
institutions in China. Overall, it can be stated that the number of research studies
on XAI escalated in the regions where the government authorities put forward some
programs or regulations. Concerning the recent regulatory developments, it is safe to
assume that the government funding agencies have increased patronising this specific
field which has resulted in a higher number of research publications, as shown in
Figure B.7.
In the subsequent sections, significant aspects of developing XAI methods are
discussed, including addressing the RQs (defined in Section B.4.1.2) with respect to
the defined features and outcomes of the performed analyses.
B.6.1 Input Data and Models for Primary Task

Input data were stated to be an essential aspect to be considered for developing
explainable systems by Vilone and Longo (2020). Therefore, the different forms of
input data which were deliberately used in the studies of the selected articles were
investigated in this review. It was observed that the vectors containing numeric
values were used in most of the articles, followed by the use of images as input. With
157
the growing variety of data forms, more concentration is required to explain models
and decisions that can be derived from other forms of data, such as graphs and texts.
However, from the findings of this study, it is apparent that some specific forms of
data are already being exploited by the researchers of respective subjects in a limited
margin; for example, graph structures are considered as input to XAI methodologies
developed with fuzzy and neuro-fuzzy models. The uses of different input data types
are illustrated in Figure B.10 within the structure of a Venn diagram as many of the
articles used multiple types of input data for their proposed models, and the Venn
diagram has the capability of presenting combined relations in terms of frequencies.
While investigating the models that were designed or applied to solve primary
tasks, it was observed that most of the studies were performed concerning neural
networks. Specifically, out of 122 articles on XAI methods, 60 articles presented
work with various neural networks. The reason behind this overwhelming interest
of researchers towards making neural networks explainable is undoubtedly the
performance of these types of models in various tasks from diverse domains. A good
number of studies utilised ensemble methods, fuzzy models and tree-based models.
Other significant types of models were found to be SVM, CBR and Bayesian models
(Table B.5).
B.6.2 Development of Explainable Models in Different Application

Domains
This section addresses this review study’s outcome within the scope of RQ1: What
are the application domains and tasks in which XAI is being explored and exploited?
The question was further split into three research sub-questions to more precisely
analyse the subject.
B.6.2.1 Application Domains and Tasks

To generate insight into the possible fields of application of XAI methods, RQ1.1 was
raised. A broader idea of the concerned application domains and tasks was developed
from the metadata analysis. As illustrated in Figure B.2a, most of the articles
were published without targeting any specific domain, which extends the horizon
for XAI researchers to utilise the concepts from the studies and further enhance
them in a domain-specific or domain-agnostic way. In the case of the domain-specific
publications on XAI, the healthcare domain has been being developed much more
than the other domains. The reason behind this massive interest in XAI from the
healthcare domain is unquestionably the involvement of machines in matters that
deal with human lives. Simultaneously, it was observed from Figure B.2b that most
of the research studies were carried out to make decision support systems more
explainable to users. Additionally, a good number of studies have been performed on
image processing and recommender systems. All these application tasks can also be
employed in the healthcare domain. From the distribution of the articles based on the
application domain and tasks, it could be concluded that XAI has been profoundly
exploited where humans are directly involved.
158
Paper B
B.6.2.2 Explainable Models

RQ1.1 was proposed to investigate the models that are explainable by design. From
the theoretical point of view, as discussed in Section B.2, the inference mechanism
of some models can be understood by humans provided they have a certain level of
expertise. In reality, these models are often termed transparent models. Barredo
Arrieta et al. (2020) categorised linear/logistic regression, decision trees, k-nearest
neighbours, rule-based learners, general additive models and Bayesian models as
transparent AI/ML models. Concerning the stages of generating explanations, ante
hoc methods are invoked for the transparent models where the explanations are
generated based on their transparency by design. Table B.6 presents the methods
available for generating explanations. Similar shreds of evidence found that ante
hoc methods were used for generating explanations from most of the transparent
models used for solving the primary task of classification/regression or clustering.
On the other hand, post hoc methods were observed in action for the simplification
of ensemble models, neural networks, SVMs, etc. (Table B.6). Generally, in the post
hoc method, a surrogate model is developed to mimic the inference mechanism of the
black-box models, which is comparatively simpler and less complex than ante hoc
methods, where the explanation is generated during the inference process. It can be
deduced from the thematic synthesis of the selected articles that post hoc methods are
suitable for the established and running systems without manipulating the prevailing
mechanism and performance of the systems. However, for new systems with the
requirement of explaining model decisions, ante hoc methods are more appropriate.
In addition, visualisation and feature relevance techniques were induced to generate
explanations for users of different levels of expertise. As a result, several tools for
post hoc methods, such as LIME, SHAP, Anchors, and ELI5 and their variations
have evolved for advanced users. Researchers from different domains have utilised
these tools and added explainability to the black-box AI/ML models.
B.6.2.3 Forms of Explanation

The outcome of an explainable model, i.e., the form of an explanation, was the prime
concern of RQ1.2. Four basic types of explanations were observed, i.e., numeric,
rule-based, visual and textual (Figure B.12). In addition to that, some of the
articles presented mixed explanations, which combined the four types. Generally,
visualisations are mostly used, which humans can more easily interpret than other
types of explanations. This type of explanation contains charts, trend-lines etc., and
conventionally visual explanation is preferable for image processing tasks. Numeric
explanations were deliberately adopted in the developed systems targeted by the
experts to show the clarification of the decision of a model with respect to different
attributes in terms of feature importance. Understanding the numbers associated
with different attributes seems slightly more difficult than the visual or textual
representation for a general end-user. For providing numeric explanations, ante
hoc methods are very few compared to post hoc methods. Rule-based methods
are generally produced from the tree-based or ensemble methods, and most of them
are ante hoc methods. In this type of explanation, the inference mechanisms of the
models were presented in the form of a table containing all the rules and tree-like
graphs depicting the decision process in short. Finally, the textual explanations
159
are some statements presented in a human-understandable format, which are less

common than the other forms of explanations. This type of explanation can
be adopted for the interactive systems where general users are involved but it
demands higher computational complexity due to NLP tasks. In summary, textual
explanations in the form of natural language should be presented for the general users,
rule-based explanations and visualisations are found to be appropriate for advanced
users, and numeric explanations are mostly appropriate for experts.
B.6.3 Evaluation Metrics for Explainable Models

This section addresses RQ1.3, which was proposed to investigate the development of
evaluation methods for the explainability of a model and the metrics for validating
the generated explanations. Currently available methods of evaluating explainable
AI/ML models are apparently not as substantial as those for state-of-the-art
black-box models, let alone the evaluation metrics of the explanations. From the
studied articles, it was observed that most of the articles adopted state-of-the-art
performance metrics to validate the developed explainable models, such as accuracy,
precision, and recall. In addition to these established metrics, several works have
proposed and utilised novel metrics which are discussed in Section B.5.3.5. On the
other hand, it was found that researchers conducted user studies to validate the
quality of the explanations. In most cases, user studies included a meagre number of
participants. However, several researchers proposed effective means of measuring the
quality of an explanation and developing proper explainable models. For example,
Holzinger et al. (2019) proposed SCS to measure the causability of the explanations
generated from a model. In another article, Sokol and Flach (2020) developed an
explainability fact sheet to be followed while developing XAI methodologies, which is
a major takeaway of this review study. However, further investigation is required to
establish domain-, application-, and method-specific methodologies that keep humans
in the loop, as users’ level of expertise largely contributes to their understanding of
the explanations.
B.6.4 Open Issues and Future Research Direction

One of the objectives of this study was to sort out the open issues on developing
explainable models and propose future research directions for different application
domains and tasks. On the basis of the studies presented in the selected articles for
this SLR, it was observed that the proposed methodologies’ major limitation lies with
the evaluation of the explanations. The studies addressed this issue with different
techniques of user studies and experiments. However, there is still an urgent need
for a generic method for evaluating the explanations. Another observed issue was
algorithm-specific approaches of adding explainability. It is an obstacle to making
the established systems in action explainable. Additionally, there remain other open
issues to be addressed. Based on the observed shortcomings of prevailing explainable
models, several possible research directions are outlined below:
• It is evident in the findings of the study that safety-critical domains and

associated tasks are most facilitated with the development of XAI. However,
less investigation was performed for other sensitive domains, such as the judicial
160
Paper B
system, finance and academia, in contrast with the domains of healthcare and
industry. Further exploitation of the methods can be performed for the less
developed domains in terms of XAI;
• One of the promising research areas in the domain of networking is the
Internet of Things (IoT). The literature indicates that several applications
such as anomaly detection (Forestiero, 2021) and building information systems
(Forestiero et al., 2008; Forestiero & Papuzzo, 2021) for IoT have been
facilitated by agent-based algorithms. These applications can be further
associated with XAI methods to make them more acceptable to end-users;
• The impact of the dataset (particularly the effect of dataset imbalance, feature
dimensionality, different types of bias problems in data acquisition and dataset,
etc.) on developing an explainable model can be assessed through studies;
• It was observed that most of the works were performed done for neural networks
and through post hoc methods, explanations were generated at the local scope.
Similar cases were also observed for other models, such as SVM and ensemble
models, since their inference mechanism remains unclear to users. Although
several studies have shown approaches to produce explanations at a global
scope by mimicking the models’ behaviour, they lack performance accuracy.
More investigations can be carried out to produce an explanation in a global
scope without compromising the models’ performance for the base task;
• The major challenge of evaluating an explanation is to develop a method
that can deal with the different levels of expertise and understanding of
users. Generally, these two characteristics of users vary from person to person.
Substantial research is needed to establish a proper methodology for evaluating
the explanations based on the intended users’ expertise and capacity;
• User studies were invoked to validate explanations based on natural language,
in short, textual explanations. Automated evaluation metrics for textual
explanations are not yet prominent in the research works;
• Evaluating the quality of heatmaps as a form of visualisation is still
undiscovered beyond the visual assessment technique. In addition to heatmaps,
evaluation metrics for other visualisation techniques, e.g., saliency maps, are
yet to be defined.
B.7 Conclusion
This paper presented a thematic synthesis of articles on the application domains of

XAI methodologies and their evaluation metrics through an SLR. The significant
contributions of this study are (1) lists of application domains and tasks that
have been facilitated with the XAI methods; (2) currently available approaches for
adding explanations to AI/ML models and their evaluation metrics; (3) exploited
mediums of explanations, such as numeric and rule-based explanations. References
to the preliminary research studies could provide a cookbook to assist prospective
researchers from diverse domains in initiating research on developing new XAI
161
methodologies. However, articles published after the mentioned period were not
analysed during this study due to time constraints. Several articles were also excluded
because of specific search keywords used in the bibliographic databases. More
comprehensive primary and secondary analyses on the methodological development of
XAI are required across different application domains. We believe such studies could
expedite the human acceptability of intelligent systems. Accommodating the varying
levels of expertise will also help understand different user groups’ needs. These studies
would explicitly explore underlying characteristics of the transparent models (fuzzy,
CBR, etc.) deployed for respective tasks, carefully analyse the dataset’s impact, and
consider well-established metrics for evaluating all forms of explanations.
Author Contributions. Conceptualisation, M.R.I.; methodology, M.R.I.;

software, M.R.I.; validation, M.R.I., M.U.A. and S.B. (Shaibal Barua); formal
analysis, M.R.I.; investigation, M.R.I.; resources, M.R.I.; data curation, M.R.I.,
M.U.A. and S.B. (Shaibal Barua); writing–original draft preparation, M.R.I.;
writing–review and editing, M.R.I., M.U.A., S.B. (Shaibal Barua) and S.BM.
(Shahina Begum); visualisation, M.R.I. and M.U.A.; supervision, M.U.A. and S.BM.
(Shahina Begum); project administration, M.U.A. and S.BM. (Shahina Begum);
funding acquisition, M.U.A. All authors have read and agreed to the published version
of the manuscript.
Funding. This study was performed as a part of the following projects; i)

SIMUSAFE, funded by the European Union’s Horizon 2020 research and innovation
programme under grant agreement N. 723386, ii) BrainSafeDrive, co-funded by the
Vetenskapsrådet - The Swedish Research Council and the Ministero dell’Istruzione
dell’Università e della Ricerca della Repubblica Italiana under Italy-Sweden
Cooperation Program, and iii) ARTIMATION, funded by the SESAR Joint
Undertaking under the European Union’s Horizon 2020 research and innovation
programme under grant agreement N. 894238.
Institutional Review Board Statement. Not applicable.
Informed Consent Statement. Not applicable.
Data Availability Statement. Not applicable.
Conflicts of Interest. The authors declare no conflict of interest.
Bibliography
Adadi, A., & Berrada, M. (2018). Peeking Inside the Black-Box: A Survey on
Explainable Artificial Intelligence (XAI). IEEE Access, 6, 52138–52160.
Aghamohammadi, M., Madan, M., Hong, J. K., & Watson, I. (2019). Predicting Heart
Attack Through Explainable Artificial Intelligence. In J. M. F. Rodrigues,
P. J. S. Cardoso, J. Monteiro, R. Lam, V. V. Krzhizhanovskaya, M. H. Lees,
J. J. Dongarra, & P. M. Sloot (Eds.), Computational Science – ICCS 2019
162
Paper B
Ahmed, M. U., Barua, S., & Begum, S. (2021). Artificial Intelligence, Machine
Learning and Reasoning in Health Informatics—Case Studies. In M. A. R.
Ahad & M. U. Ahmed (Eds.), Signal Processing Techniques for
Computational Health Informatics (pp. 261–291). Springer International
Publishing.
Albaum, G. (1997). The Likert Scale Revisited. International Journal of Market
Research, 39 (2), 1–21.
Alonso, J. M., Castiello, C., & Mencar, C. (2018). A Bibliometric Analysis
of the Explainable Artificial Intelligence Research Field. In J. Medina,
M. Ojeda-Aciego, J. L. Verdegay, D. A. Pelta, I. P. Cabrera, B.
Bouchon-Meunier, & R. R. Yager (Eds.), Information Processing and
Management of Uncertainty in Knowledge-Based Systems. Theory and
Foundations (pp. 3–15). Springer International Publishing.
Alonso, J. M., Ducange, P., Pecori, R., & Vilas, R. (2020). Building Explanations for
Fuzzy Decision Trees with the ExpliClas Software. 2020 IEEE International
Conference on Fuzzy Systems (FUZZ-IEEE), 1–8.
Alonso, J. M., Toja-Alamancos, J., & Bugarín, A. (2020). Experimental Study on
Generating Multi-modal Explanations of Black-box Classifiers in terms of
Gray-box Classifiers. 2020 IEEE International Conference on Fuzzy Systems
(FUZZ-IEEE), 1–8.
Amparore, E., Perotti, A., & Bajardi, P. (2021). To Trust or Not to Trust an
Explanation: Using LEAF to Evaluate Local Linear XAI Methods. PeerJ
Computer Science, 7, e479.
Angelov, P., & Soares, E. (2020). Towards Explainable Deep Neural Networks
(xDNN). Neural Networks, 130, 185–194.
Anguita-Ruiz, A., Segura-Delgado, A., Alcalá, R., Aguilera, C. M., & Alcalá-Fdez,
J. (2020). eXplainable Artificial Intelligence (XAI) for the Identification
of Biologically Relevant Gene Expression Patterns in Longitudinal Human
Studies, Insights from Obesity Research. PLOS Computational Biology,
16 (4), e1007792.
Anysz, H., Brzozowski, Ł., Kretowicz, W., & Narloch, P. (2020). Feature Importance
of Stabilised Rammed Earth Components Affecting the Compressive
Strength Calculated with Explainable Artificial Intelligence Tools. Materials,
13 (10), 2317.
Apicella, A., Isgrò, F., Prevete, R., & Tamburrini, G. (2020). Middle-Level Features
for the Explanation of Classification Systems by Sparse Dictionary Methods.
International Journal of Neural Systems, 30 (08), 2050040.
Assaf, R., & Schumann, A. (2019). Explainable Deep Neural Networks for
Multivariate Time Series Predictions. Proceedings of the Twenty-Eighth
International Joint Conference on Artificial Intelligence (IJCAI), 6488–6490.
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., & Samek, W.
(2015). On Pixel-Wise Explanations for Non-Linear Classifier Decisions by
Layer-Wise Relevance Propagation. PLOS ONE, 10 (7), e0130140.
Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik,
S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R.,
Chatila, R., & Herrera, F. (2020). Explainable Artificial Intelligence (XAI):
Concepts, Taxonomies, Opportunities and Challenges toward Responsible
AI. Information Fusion, 58, 82–115.
163
Bharadhwaj, H., & Joshi, S. (2018). Explanations for Temporal Recommendations.

KI - Künstliche Intelligenz, 32 (4), 267–272.
Biswas, R., Barz, M., & Sonntag, D. (2020). Towards Explanatory Interactive Image
Captioning Using Top-Down and Bottom-Up Features, Beam Search and
Re-ranking. KI - Künstliche Intelligenz, 34 (4), 571–584.
Bonanno, D., Nock, K., Smith, L., Elmore, P., & Petry, F. (2017). An Apporach to
Explainable Deep Learning using Fuzzy Inference. Next-Generation Analyst
V, 10207, 132–136.
Bonidia, R. P., Machida, J. S., Negri, T. C., Alves, W. A. L., Kashiwabara, A. Y.,
Domingues, D. S., De Carvalho, A., Paschoal, A. R., & Sanches, D. S. (2020).
A Novel Decomposing Model With Evolutionary Algorithms for Feature
Selection in Long Non-Coding RNAs. IEEE Access, 8, 181683–181697.
Brunese, L., Mercaldo, F., Reginelli, A., & Santone, A. (2020). Explainable Deep
Learning for Pulmonary Disease and Coronavirus COVID-19 Detection from
X-rays. Computer Methods and Programs in Biomedicine, 196, 105608.
Callegari, C., Ducange, P., Fazzolari, M., & Vecchio, M. (2021). Explainable Internet
Traffic Classification. Applied Sciences, 11 (10), 4697.
Cao, H. E. C., Sarlin, R., & Jung, A. (2020). Learning Explainable Decision Rules
via Maximum Satisfiability. IEEE Access, 8, 218180–218185.
Carletti, M., Masiero, C., Beghi, A., & Susto, G. A. (2019). Explainable Machine
Learning in Industry 4.0: Evaluating Feature Importance in Anomaly
Detection to Enable Root Cause Analysis. 2019 IEEE International
Conference on Systems, Man and Cybernetics (SMC), 21–26.
Chaczko, Z., Kulbacki, M., Gudzbeler, G., Alsawwaf, M., Thai-Chyzhykau, I., &
Wajs-Chaczko, P. (2020). Exploration of Explainable AI in Context of
Human-Machine Interface for the Assistive Driving System. In N. T. Nguyen,
K. Jearanaitanakij, A. Selamat, B. Trawiński, & S. Chittayasothorn (Eds.),
Intelligent Information and Database Systems (pp. 507–516). Springer
International Publishing.
Chander, A., & Srinivasan, R. (2018). Evaluating Explanations by Cognitive Value.
In A. Holzinger, P. Kieseberg, A. M. Tjoa, & E. Weippl (Eds.), Machine
Learning and Knowledge Extraction (pp. 314–328). Springer International
Publishing.
Chandrasekaran, J., Lei, Y., Kacker, R., & Richard Kuhn, D. (2021). A Combinatorial
Approach to Explaining Image Classifiers. 2021 IEEE International
Conference on Software Testing, Verification and Validation Workshops
(ICSTW), 35–43.
Chen, H.-Y., & Lee, C.-H. (2020). Vibration Signals Analysis by Explainable Artificial
Intelligence (XAI) Approach: Application on Bearing Faults Diagnosis. IEEE
Access, 8, 134246–134256.
Chen, J.-H., Chen, S. Y.-C., Tsai, Y.-C., & Shur, C.-S. (2020). Explainable Deep
Convolutional Candlestick Learner. Proceedings of the 32nd International
Conference on Software Engineering & Knowledge Engineering (SEKE
2020).
Chou, Y.-h., Hong, S., Zhou, Y., Shang, J., Song, M., & Li, H. (2020).
Knowledge-shot Learning: An Interpretable Deep Model For Classifying
Imbalanced Electrocardiography Data. Neurocomputing, 417, 64–73.
164
Paper B
Csiszár, O., Csiszár, G., & Dombi, J. (2020). Interpretable Neural Networks
based on Continuous-valued Logic and Multicriteria Decision Operators.
Knowledge-Based Systems, 199, 105972.
Dağlarli, E. (2020). Explainable Artificial Intelligence (xAI) Approaches and Deep
Meta-Learning Models. In Advances and Applications in Deep Learning.
IntechOpen.
D’Alterio, P., Garibaldi, J. M., & John, R. I. (2020). Constrained Interval
Type-2 Fuzzy Classification Systems for Explainable AI (XAI). 2020 IEEE
International Conference on Fuzzy Systems (FUZZ-IEEE), 1–8.
Dam, H. K., Tran, T., & Ghose, A. (2018). Explainable software analytics.
Proceedings of the 40th International Conference on Software Engineering:
New Ideas and Emerging Results (ICSE-NIER), 53–56.
Da’u, A., & Salim, N. (2020). Recommendation System based on Deep Learning
Methods: A Systematic Review and New Directions. Artificial Intelligence
Review, 53 (4), 2709–2748.
De, T., Giri, P., Mevawala, A., Nemani, R., & Deo, A. (2020). Explainable AI: A
Hybrid Approach to Generate Human-Interpretable Explanation for Deep
Learning Prediction. Procedia Computer Science, 168, 40–48.
de Sousa, I. P., Maria Bernardes Rebuzzi Vellasco, M., & Costa da Silva, E. (2019).
Local Interpretable Model-Agnostic Explanations for Classification of Lymph
Node Metastases. Sensors, 19 (13), 2969.
Díaz-Rodríguez, N., & Pisoni, G. (2020). Accessible Cultural Heritage through
Explainable Artificial Intelligence. Adjunct Publication of the 28th ACM
Conference on User Modeling, Adaptation and Personalization (UMAP),
317–324.
Dindorf, C., Teufl, W., Taetz, B., Bleser, G., & Fröhlich, M. (2020). Interpretability
of Input Representations for Gait Classification in Patients after Total Hip
Arthroplasty. Sensors, 20 (16), 4385.
Došilović, F. K., Brčić, M., & Hlupić, N. (2018). Explainable Artificial Intelligence:
A Survey. 2018 41st International Convention on Information and
Communication Technology, Electronics and Microelectronics (MIPRO),
0210–0215.
Dujmović, J. (2020). Interpretability and Explainability of LSP Evaluation Criteria.
2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 1–8.
Dutta, V., & Zielińska, T. (2020). An Adversarial Explainable Artificial Intelligence
(XAI) Based Approach for Action Forecasting. Journal of Automation,
Mobile Robotics and Intelligent Systems, 3–10.
Eisenstadt, V., Espinoza-Stapelfeld, C., Mikyas, A., & Althoff, K.-D. (2018).
Explainable Distributed Case-Based Support Systems: Patterns for
Enhancement and Validation of Design Recommendations. In M. T. Cox, P.
Funk, & S. Begum (Eds.), Case-Based Reasoning Research and Development
Fellous, J.-M., Sapiro, G., Rossi, A., Mayberg, H., & Ferrante, M. (2019). Explainable
Artificial Intelligence for Neuroscience: Behavioral Neurostimulation.
Frontiers in Neuroscience, 13.
Féraud, R., & Clérot, F. (2002). A Methodology to Explain Neural Network
Classification. Neural Networks, 15 (2), 237–246.
165
Fernández, R. R., Martín de Diego, I., Aceña, V., Fernández-Isabel, A., & Moguerza,
J. M. (2020). Random Forest Explainability using Counterfactual Sets.
Information Fusion, 63, 196–207.
Ferreyra, E., Hagras, H., Kern, M., & Owusu, G. (2019). Depicting Decision-Making:
A Type-2 Fuzzy Logic Based Explainable Artificial Intelligence System for
Goal-Driven Simulation in the Workforce Allocation Domain. 2019 IEEE
Forestiero, A. (2021). Metaheuristic Algorithm for Anomaly Detection in Internet of
Things leveraging on a Neural-driven Multiagent System. Knowledge-Based
Systems, 228, 107241.
Forestiero, A., Mastroianni, C., & Spezzano, G. (2008). Reorganization and Discovery
of Grid Information with Epidemic Tuning. Future Generation Computer
Systems, 24 (8), 788–797.
Forestiero, A., & Papuzzo, G. (2021). Agents-Based Algorithm for a Distributed
Information System in Internet of Things. IEEE Internet of Things Journal,
8 (22), 16548–16558.
Gade, K., Geyik, S. C., Kenthapadi, K., Mithal, V., & Taly, A. (2019). Explainable AI
in Industry. Proceedings of the 25th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining (KDD), 3203–3204.
Galhotra, S., Pradhan, R., & Salimi, B. (2021). Explaining Black-Box Algorithms
Using Probabilistic Contrastive Counterfactuals. Proceedings of the 2021
International Conference on Management of Data (SIGMOD), 577–590.
García-Holgado, A., Marcos-Pablos, S., & García-Peñalvo, F. (2020). Guidelines for
performing Systematic Research Projects Reviews. International Journal of
Interactive Multimedia and Artificial Intelligence, 6 (Regular Issue), 136–145.
García-Magariño, I., Muttukrishnan, R., & Lloret, J. (2019). Human-Centric AI for
Trustworthy IoT Systems With Explainable Multilayer Perceptrons. IEEE
Access, 7, 125562–125574.
Genc-Nayebi, N., & Abran, A. (2017). A Systematic Literature Review: Opinion
Mining Studies from Mobile App Store User Reviews. Journal of Systems
and Software, 125, 207–219.
Goebel, R., Chander, A., Holzinger, K., Lecue, F., Akata, Z., Stumpf, S., Kieseberg,
P., & Holzinger, A. (2018). Explainable AI: The New 42? In A. Holzinger,
P. Kieseberg, A. M. Tjoa, & E. Weippl (Eds.), Machine Learning and
Knowledge Extraction (pp. 295–303). Springer International Publishing.
Graziani, M., Andrearczyk, V., Marchand-Maillet, S., & Müller, H. (2020). Concept
Attribution: Explaining CNN Decisions to Physicians. Computers in Biology
and Medicine, 123, 103865.
Guidotti, R., Monreale, A., Giannotti, F., Pedreschi, D., Ruggieri, S., & Turini, F.
(2019). Factual and Counterfactual Explanations for Black Box Decision
Making. IEEE Intelligent Systems, 34 (6), 14–23.
Gulum, M. A., Trombley, C. M., & Kantardzic, M. (2021). A Review of Explainable
Deep Learning Cancer Detection Models in Medical Imaging. Applied
Sciences, 11 (10), 4573.
166
Paper B
Gunning, D., & Aha, D. W. (2019). DARPA’s Explainable Artificial Intelligence

Program. AI Magazine, 40 (2), 44–58.
Han, M., & Kim, J. (2019). Joint Banknote Recognition and Counterfeit Detection
Using Explainable Artificial Intelligence. Sensors, 19 (16), 3607.
Hatwell, J., Gaber, M. M., & Azad, R. M. A. (2020). Ada-WHIPS: Explaining
AdaBoost Classification with Applications in the Health Sciences. BMC
Medical Informatics and Decision Making, 20 (1), 250.
Hatwell, J., Gaber, M. M., & Azad, R. M. A. (2021). Gbt-HIPS: Explaining the
Classifications of Gradient Boosted Tree Ensembles. Applied Sciences, 11 (6),
2511.
He, X., Chen, T., Kan, M.-Y., & Chen, X. (2015). TriRank: Review-aware Explainable
Recommendation by Modeling Aspects. Proceedings of the 24th ACM
International on Conference on Information and Knowledge Management
(CIKM), 1661–1670.
Helbich, M., Hagenauer, J., Leitner, M., & Edwards, R. (2013). Exploration
of Unstructured Narrative Crime Reports: An Unsupervised Neural
Network and Point Pattern Analysis Approach. Cartography and Geographic
Information Science, 40 (4), 326–336.
Hendricks, L. A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., & Darrell, T.
(2016). Generating Visual Explanations. In B. Leibe, J. Matas, N. Sebe,
& M. Welling (Eds.), Computer Vision – ECCV 2016 (pp. 3–19). Springer
Holzinger, A., Carrington, A., & Müller, H. (2020). Measuring the Quality
of Explanations: The System Causability Scale (SCS). KI - Künstliche
Intelligenz, 34 (2), 193–198.
Holzinger, A., Langs, G., Denk, H., Zatloukal, K., & Müller, H. (2019). Causability
and Explainability of Artificial Intelligence in Medicine. WIREs Data Mining
and Knowledge Discovery, 9 (4), e1312.
Hong, C. W., Lee, C., Lee, K., Ko, M.-S., Kim, D. E., & Hur, K. (2020). Remaining
Useful Life Prognosis for Turbofan Engine Using Explainable Deep Neural
Networks with Dimensionality Reduction. Sensors, 20 (22), 6626.
Hu, Z., & Beyeler, M. (2021). Explainable AI for Retinal Prostheses: Predicting
Electrode Deactivation from Routine Clinical Measures. 2021 10th
International IEEE/EMBS Conference on Neural Engineering (NER),
792–796.
Huang, L.-C., Yeung, W., Wang, Y., Cheng, H., Venkat, A., Li, S., Ma, P., Rasheed,
K., & Kannan, N. (2020). Quantitative Structure–Mutation–Activity
Relationship Tests (QSMART) Model for Protein Kinase Inhibitor Response
Prediction. BMC Bioinformatics, 21 (1), 520.
Islam, M. A., Anderson, D. T., Pinar, A. J., Havens, T. C., Scott, G., & Keller, J. M.
(2020). Enabling Explainable Fusion in Deep Learning With Fuzzy Integral
Neural Networks. IEEE Transactions on Fuzzy Systems, 28 (7), 1291–1300.
Itani, S., Lecron, F., & Fortemps, P. (2020). A One-class Classification Decision Tree
based on Kernel Density Estimation. Applied Soft Computing, 91, 106250.
Jiménez-Luna, J., Grisoni, F., & Schneider, G. (2020). Drug Discovery with
Explainable Artificial Intelligence. Nature Machine Intelligence, 2 (10),
573–584.
167
Jung, A., & Nardelli, P. H. J. (2020). An Information-Theoretic Approach to

Personalized Explainable Machine Learning. IEEE Signal Processing Letters,
27, 825–829.
Jung, Y.-J., Han, S.-H., & Choi, H.-J. (2021). Explaining CNN and RNN Using
Selective Layer-Wise Relevance Propagation. IEEE Access, 9, 18670–18681.
Karlsson, I., Rebane, J., Papapetrou, P., & Gionis, A. (2020). Locally and Globally
Explainable Time Series Tweaking. Knowledge and Information Systems,
62 (5), 1671–1700.
Keneni, B. M., Kaur, D., Al Bataineh, A., Devabhaktuni, V. K., Javaid, A. Y.,
Zaientz, J. D., & Marinier, R. P. (2019). Evolving Rule-Based Explainable
Artificial Intelligence for Unmanned Aerial Vehicles. IEEE Access, 7,
17001–17016.
Kim, J., & Canny, J. (2017). Interpretable Learning for Self-Driving Cars by
Visualizing Causal Attention. 2017 IEEE International Conference on
Computer Vision (ICCV), 2961–2969.
Kitchenham, B. A., & Charters, S. (2007). Guidelines for performing Systematic
Literature Reviews in Software Engineering | BibSonomy (tech. rep.). Keele
University and Durham University. UK.
Kouki, P., Schaffer, J., Pujara, J., O’Donovan, J., & Getoor, L. (2020). Generating
and Understanding Personalized Explanations in Hybrid Recommender
Systems. ACM Transactions on Interactive Intelligent Systems, 10 (4),
31:1–31:40.
Kovalev, M. S., & Utkin, L. V. (2020). A Robust Algorithm for Explaining Unreliable
Machine Learning Survival Models using Kolmogorov–Smirnov Bounds.
Neural Networks, 132, 1–18.
Kovalev, M. S., Utkin, L. V., & Kasimov, E. M. (2020). SurvLIME: A Method for
Explaining Machine Learning Survival Models. Knowledge-Based Systems,
203, 106164.
Kwon, B. C., Choi, M.-J., Kim, J. T., Choi, E., Kim, Y. B., Kwon, S., Sun,
J., & Choo, J. (2019). RetainVis: Visual Analytics with Interpretable and
Interactive Recurrent Neural Networks on Electronic Medical Records. IEEE
Transactions on Visualization and Computer Graphics, 25 (1), 299–309.
La Gatta, V., Moscato, V., Postiglione, M., & Sperlì, G. (2021a). CASTLE:
Cluster-Aided Space Transformation for Local Explanations. Expert Systems
with Applications, 179, 115045.
La Gatta, V., Moscato, V., Postiglione, M., & Sperlì, G. (2021b). PASTLE:
Pivot-Aided Space Transformation for Local Explanations. Pattern
Recognition Letters, 149, 67–74.
Lacave, C., & Díez, F. J. (2002). A Review of Explanation Methods for Bayesian
Networks. The Knowledge Engineering Review, 17 (2), 107–127.
Lage, I., Chen, E., He, J., Narayanan, M., Kim, B., Gershman, S., & Doshi-Velez,
F. (2018). An Evaluation of the Human-Interpretability of Explanation.
Proceedings of the 32nd International Conference on Neural Information
Processing Systems (NeurIPS).
Lamy, J.-B., Sedki, K., & Tsopra, R. (2020). Explainable Decision Support Through
the Learning and Visualization of Preferences from a Formal Ontology of
Antibiotic Treatments. Journal of Biomedical Informatics, 104, 103407.
168
Paper B
Lamy, J.-B., Sekar, B., Guezennec, G., Bouaud, J., & Séroussi, B. (2019). Explainable
Artificial Intelligence for Breast Cancer: A Visual Case-based Reasoning
Approach. Artificial Intelligence in Medicine, 94, 42–53.
Laugel, T., Lesot, M.-J., Marsala, C., Renard, X., & Detyniecki, M. (2018).
Comparison-Based Inverse Classification for Interpretability in Machine
Learning. In J. Medina, M. Ojeda-Aciego, J. L. Verdegay, D. A. Pelta, I. P.
Cabrera, B. Bouchon-Meunier, & R. R. Yager (Eds.), Information Processing
and Management of Uncertainty in Knowledge-Based Systems. Theory and
Lauritsen, S. M., Kristensen, M., Olsen, M. V., Larsen, M. S., Lauritsen, K. M.,
Jørgensen, M. J., Lange, J., & Thiesson, B. (2020). Explainable Artificial
Intelligence Model to Predict Acute Critical Illness from Electronic Health
Records. Nature Communications, 11 (1), 3852.
Le, T., Wang, S., & Lee, D. (2020). GRACE: Generating Concise and Informative
Contrastive Sample to Explain Neural Network Model’s Prediction.
Proceedings of the 26th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (KDD), 238–248.
Letham, B., Rudin, C., McCormick, T. H., & Madigan, D. (2015). Interpretable
Classifiers using Rules and Bayesian Analysis: Building a Better Stroke
Prediction Model. The Annals of Applied Statistics, 9 (3), 1350–1371.
Li, Y., Wang, H., Dang, L. M., Nguyen, T. N., Han, D., Lee, A., Jang, I., & Moon,
H. (2020). A Deep Learning-Based Hybrid Framework for Object Detection
and Recognition in Autonomous Driving. IEEE Access, 8, 194228–194239.
Lin, Z., Lyu, S., Cao, H., Xu, F., Wei, Y., Samet, H., & Li, Y. (2020). HealthWalks:
Sensing Fine-grained Individual Health Condition via Mobility Data.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous
Technologies, 4 (4), 138:1–138:26.
Lindsay, L., Coleman, S., Kerr, D., Taylor, B., & Moorhead, A. (2020). Explainable
Artificial Intelligence for Falls Prediction. In M. Singh, P. K. Gupta, V.
Tyagi, J. Flusser, T. Ören, & G. Valentino (Eds.), Advances in Computing
and Data Sciences (pp. 76–84). Springer.
Longo, L., Goebel, R., Lecue, F., Kieseberg, P., & Holzinger, A. (2020). Explainable
Artificial Intelligence: Concepts, Applications, Research Challenges and
Visions. In A. Holzinger, P. Kieseberg, A. M. Tjoa, & E. Weippl
(Eds.), Machine Learning and Knowledge Extraction (pp. 1–16). Springer
Lorente, M. P. S., Lopez, E. M., Florez, L. A., Espino, A. L., Martínez, J. A. I.,
& de Miguel, A. S. (2021). Explaining Deep Learning-Based Driver Models.
Applied Sciences, 11 (8), 3321.
Loyola-Gonzalez, O. (2019). Black-Box vs. White-Box: Understanding Their
Advantages and Weaknesses From a Practical Point of View. IEEE Access,
7, 154096–154113.
Loyola-González, O. (2019). Understanding the Criminal Behavior in Mexico
City through an Explainable Artificial Intelligence Model. In L.
Martínez-Villaseñor, I. Batyrshin, & A. Marín-Hernández (Eds.), Advances
in Soft Computing (pp. 136–149). Springer International Publishing.
169
Loyola-González, O., Gutierrez-Rodríguez, A. E., Medina-Pérez, M. A., Monroy,

R., Martínez-Trinidad, J. F., Carrasco-Ochoa, J. A., & García-Borroto, M.
(2020). An Explainable Artificial Intelligence Model for Clustering Numerical
Databases. IEEE Access, 8, 52370–52384.
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz,
R., Himmelfarb, J., Bansal, N., & Lee, S.-I. (2020). From Local Explanations
to Global Understanding with Explanable AI for Trees. Nature Machine
Intelligence, 2 (1), 56–67.
Lundberg, S. M., Nair, B., Vavilala, M. S., Horibe, M., Eisses, M. J., Adams, T.,
Liston, D. E., Low, D. K.-W., Newman, S.-F., Kim, J., & Lee, S.-I. (2018).
Explainable Machine Learning Predictions to Help Anesthesiologists Prevent
Hypoxemia During Surgery. Nature Biomedical Engineering, 2 (10), 749–760.
Magdalena, L. (2018). Designing Interpretable Hierarchical Fuzzy Systems. 2018
IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 1–8.
Martínez-Cebrián, J., Fernández-Torres, M.-Á., & Díaz-De-María, F. (2020).
Interpretable Global-Local Dynamics for the Prediction of Eye Fixations
in Autonomous Driving Scenarios. IEEE Access, 8, 217068–217085.
Massie, S., Craw, S., & Wiratunga, N. (2005). A Visualisation Tool to Explain
Case-Base Reasoning Solutions for Tablet Formulation. In A. Macintosh, R.
Ellis, & T. Allen (Eds.), Applications and Innovations in Intelligent Systems
XII (pp. 222–234). Springer.
Mathews, S. M. (2019). Explainable Artificial Intelligence Applications in NLP,
Biomedical, and Malware Classification: A Literature Review. In K. Arai, R.
Bhatia, & S. Kapoor (Eds.), Intelligent Computing (pp. 1269–1292). Springer
Meskauskas, Z., Jasinevicius, R., Kazanavicius, E., & Petrauskas, V. (2020).
XAI-Based Fuzzy SWOT Maps for Analysis of Complex Systems. 2020 IEEE
Ming, Y., Qu, H., & Bertini, E. (2019). RuleMatrix: Visualizing and Understanding
Classifiers with Rules. IEEE Transactions on Visualization and Computer
Graphics, 25 (1), 342–352.
Mittelstadt, B., Russell, C., & Wachter, S. (2019). Explaining Explanations in AI.
Proceedings of the Conference on Fairness, Accountability, and Transparency
(FAT*), 279–288.
Moher, D., Liberati, A., Tetzlaff, J., Altman, D. G., & Group, T. P. (2009). Preferred
Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA
Statement. PLOS Medicine, 6 (7), e1000097.
Montavon, G., Lapuschkin, S., Binder, A., Samek, W., & Müller, K.-R.
(2017). Explaining Nonlinear Classification Decisions with Deep Taylor
Decomposition. Pattern Recognition, 65, 211–222.
Moradi, M., & Samwald, M. (2021). Post-hoc Explanation of Black-box Classifiers
using Confident Itemsets. Expert Systems with Applications, 165, 113941.
Muddamsetty, S. M., Jahromi, M. N. S., & Moeslund, T. B. (2021). Expert Level
Evaluations for Explainable AI (XAI) Methods in the Medical Domain.
170
Paper B
In A. Del Bimbo, R. Cucchiara, S. Sclaroff, G. M. Farinella, T. Mei, M.

Bertini, H. J. Escalante, & R. Vezzani (Eds.), Pattern Recognition. ICPR
International Workshops and Challenges (pp. 35–46). Springer International
Publishing.
Murray, B. J., Anderson, D. T., Havens, T. C., Wilkin, T., & Wilbik, A.
(2020). Information Fusion-2-Text: Explainable Aggregation via Linguistic
Protoforms. In M.-J. Lesot, S. Vieira, M. Z. Reformat, J. P. Carvalho, A.
Wilbik, B. Bouchon-Meunier, & R. R. Yager (Eds.), Information Processing
and Management of Uncertainty in Knowledge-Based Systems (pp. 114–127).
Neches, R., Swartout, W., & Moore, J. (1985). Enhanced Maintenance and
Explanation of Expert Systems Through Explicit Models of Their
Development. IEEE Transactions on Software Engineering, SE-11 (11),
1337–1351.
Nowak, T., Nowicki, M. R., Ćwian, K., & Skrzypczyński, P. (2019). How to Improve
Object Detection in a Driver Assistance System Applying Explainable Deep
Learning. 2019 IEEE Intelligent Vehicles Symposium (IV), 226–231.
Oh, K., Kim, S., & Oh, I.-S. (2020). Salient Explanation for Fine-Grained
Classification. IEEE Access, 8, 61433–61441.
Oramas, J., Wang, K., & Tuytelaars, T. (2019). Visual Explanation by
Interpretation: Improving Visual Feedback Capabilities of Deep Neural
Networks. Proceedings of the Seventh International Conference on Learning
Representations (ICLR).
Panigutti, C., Perotti, A., & Pedreschi, D. (2020). Doctor XAI: An Ontology-based
Approach to Black-box Sequential Data Classification Explanations.
Proceedings of the 2020 Conference on Fairness, Accountability, and
Transparency (FAT*), 629–639.
Payrovnaziri, S. N., Chen, Z., Rengifo-Moreno, P., Miller, T., Bian, J., Chen, J. H.,
Liu, X., & He, Z. (2020). Explainable Artificial Intelligence Models using
Real-world Electronic Health Record Data: A Systematic Scoping Review.
Journal of the American Medical Informatics Association, 27 (7), 1173–1185.
Pierrard, R., Poli, J.-P., & Hudelot, C. (2018). Learning Fuzzy Relations and
Properties for Explainable Artificial Intelligence. 2018 IEEE International
Conference on Fuzzy Systems (FUZZ-IEEE), 1–8.
Pintelas, E., Liaskos, M., Livieris, I. E., Kotsiantis, S., & Pintelas, P. (2020).
Explainable Machine Learning Framework for Image Classification Problems:
Case Study on Glioma Cancer Prediction. Journal of Imaging, 6 (6), 37.
Plumb, G., Molitor, D., & Talwalkar, A. (2018). Model Agnostic Supervised Local
Explanations. Proceedings of the 32nd International Conference on Neural
Ponn, T., Kröger, T., & Diermeyer, F. (2020). Identification and Explanation of
Challenging Conditions for Camera-Based Object Detection of Automated
Vehicles. Sensors, 20 (13), 3699.
Porto, R., Molina, J. M., Berlanga, A., & Patricio, M. A. (2021). Minimum Relevant
Features to Obtain Explainable Systems for Predicting Cardiovascular
Disease Using the Statlog Data Set. Applied Sciences, 11 (3), 1285.
171
Poyiadzi, R., Sokol, K., Santos-Rodriguez, R., De Bie, T., & Flach, P. (2020). FACE:
Feasible and Actionable Counterfactual Explanations. Proceedings of the
AAAI/ACM Conference on AI, Ethics, and Society, 344–350.
Preece, A., Harborne, D., Braines, D., Tomsett, R., & Chakraborty, S. (2018).
Stakeholders in Explainable AI. Proceedings of AAAI FSS-18: Artificial
Intelligence in Government and Public Sector.
Prifti, E., Chevaleyre, Y., Hanczar, B., Belda, E., Danchin, A., Clément, K., &
Zucker, J.-D. (2020). Interpretable and Accurate Prediction Models for
Metagenomics Data. GigaScience, 9 (3), giaa010.
Rai, A. (2020). Explainable AI: From Black Box to Glass Box. Journal of the Academy
of Marketing Science, 48 (1), 137–141.
Ramos-Soto, A., & Pereira-Fariña, M. (2018). Reinterpreting Interpretability for
Fuzzy Linguistic Descriptions of Data. In J. Medina, M. Ojeda-Aciego,
J. L. Verdegay, D. A. Pelta, I. P. Cabrera, B. Bouchon-Meunier, & R. R.
Yager (Eds.), Information Processing and Management of Uncertainty in
Knowledge-Based Systems. Theory and Foundations (pp. 40–51). Springer
Rehse, J.-R., Mehdiyev, N., & Fettke, P. (2019). Towards Explainable Process
Predictions for Industry 4.0 in the DFKI-Smart-Lego-Factory. KI -
Künstliche Intelligenz, 33 (2), 181–187.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016a). “Why Should I Trust You?”:
Mining (KDD), 1135–1144.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016b). Model-Agnostic Interpretability
of Machine Learning. 2016 ICML Workshop on Human Interpretability in
Machine Learning (WHI).
Ribeiro, M. T., Singh, S., & Guestrin, C. (2018). Anchors: High-Precision
Model-Agnostic Explanations. Proceedings of the AAAI Conference on
Artificial Intelligence, 32.
Rio-Torto, I., Fernandes, K., & Teixeira, L. F. (2020). Understanding the Decisions
of CNNs: An In-model Approach. Pattern Recognition Letters, 133, 373–380.
Riquelme, F., De Goyeneche, A., Zhang, Y., Niebles, J. C., & Soto, A. (2020).
Explaining VQA Predictions using Visual Grounding and a Knowledge Base.
Image and Vision Computing, 101, 103968.
Robnik-Šikonja, M., & Bohanec, M. (2018). Perturbation-Based Explanations of
Prediction Models. In J. Zhou & F. Chen (Eds.), Human and Machine
Learning: Visible, Explainable, Trustworthy and Transparent (pp. 159–175).
Rubio-Manzano, C., Segura-Navarrete, A., Martinez-Araneda, C., & Vidal-Castro,
C. (2021). Explainable Hopfield Neural Networks Using an Automatic
Video-Generation System. Applied Sciences, 11 (13), 5771.
Rudin, C. (2019). Stop Explaining Black Box Machine Learning Models for High
Stakes Decisions and Use Interpretable Models Instead. Nature Machine
Intelligence, 1 (5), 206–215.
Rutkowski, T., Łapa, K., & Nielek, R. (2019). On Explainable Fuzzy Recommenders
and their Performance Evaluation. International Journal of Applied
Mathematics and Computer Science, 29 (3), 595–610.
172
Paper B
Sabol, P., Sinčák, P., Magyar, J., & Hartono, P. (2019). Semantically Explainable
Fuzzy Classifier. International Journal of Pattern Recognition and Artificial
Intelligence, 33 (12), 2051006.
Samek, W., Binder, A., Montavon, G., Lapuschkin, S., & Müller, K.-R. (2017).
Evaluating the Visualization of What a Deep Neural Network Has Learned.
IEEE Transactions on Neural Networks and Learning Systems, 28 (11),
2660–2673.
Samek, W., & Müller, K.-R. (2019). Towards Explainable Artificial Intelligence. In
W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, & K.-R. Müller (Eds.),
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning
Sarathy, N., Alsawwaf, M., & Chaczko, Z. (2020). Investigation of an Innovative
Approach for Identifying Human Face-Profile Using Explainable Artificial
Intelligence. 2020 IEEE 18th International Symposium on Intelligent
Systems and Informatics (SISY), 155–160.
Sarp, S., Kuzlu, M., Cali, U., Elma, O., & Guler, O. (2021). An Interpretable Solar
Photovoltaic Power Generation Forecasting Approach Using An Explainable
Artificial Intelligence Tool. 2021 IEEE Power & Energy Society Innovative
Smart Grid Technologies Conference (ISGT), 1–5.
Schönhof, R., Werner, A., Elstner, J., Zopcsak, B., Awad, R., & Huber, M. (2021).
Feature Visualization within an Automated Design Assessment Leveraging
Explainable Artificial Intelligence Methods. Procedia CIRP, 100, 331–336.
Schorr, C., Goodarzi, P., Chen, F., & Dahmen, T. (2021). Neuroscope: An
Explainable AI Toolbox for Semantic Segmentation and Image Classification
of Convolutional Neural Nets. Applied Sciences, 11 (5), 2199.
Segura, V., Brandão, B., Fucs, A., & Vital Brazil, E. (2019). Towards Explainable AI
Using Similarity: An Analogues Visualization System. In A. Marcus & W.
Wang (Eds.), Design, User Experience, and Usability. User Experience in
Advanced Technological Environments (pp. 389–399). Springer International
Publishing.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2020).
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based
Localization. International Journal of Computer Vision, 128 (2), 336–359.
Senatore, R., Della Cioppa, A., & Marcelli, A. (2019). Automatic Diagnosis of
Neurodegenerative Diseases: An Evolutionary Approach for Facing the
Interpretability Problem. Information, 10 (1), 30.
Serradilla, O., Zugasti, E., Cernuda, C., Aranburu, A., de Okariz, J. R., &
Zurutuza, U. (2020). Interpreting Remaining Useful Life Estimations
Combining Explainable Artificial Intelligence and Domain Knowledge in
Industrial Machinery. 2020 IEEE International Conference on Fuzzy Systems
(FUZZ-IEEE), 1–8.
Shalaeva, V., Alkhoury, S., Marinescu, J., Amblard, C., & Bisson, G. (2018).
Multi-Operator Decision Trees for Explainable Time-Series Classification.
In J. Medina, M. Ojeda-Aciego, J. L. Verdegay, D. A. Pelta, I. P. Cabrera,
B. Bouchon-Meunier, & R. R. Yager (Eds.), Information Processing and
Management of Uncertainty in Knowledge-Based Systems. Theory and
173
Soares, E., Angelov, P., & Gu, X. (2020). Autonomous Learning Multiple-Model
Zero-order Classifier for Heart Sound Classification. Applied Soft Computing,
94, 106449.
Sokol, K., & Flach, P. (2020). Explainability Fact Sheets: A Framework for Systematic
Assessment of Explainable Approaches. Proceedings of the 2020 Conference
on Fairness, Accountability, and Transparency (FAT*), 56–67.
Spinner, T., Schlegel, U., Schäfer, H., & El-Assady, M. (2020). explAIner: A
Visual Analytics Framework for Interactive and Explainable Machine
Learning. IEEE Transactions on Visualization and Computer Graphics,
26 (1), 1064–1074.
Štrumbelj, E., & Kononenko, I. (2014). Explaining Prediction Models and Individual
Predictions with Feature Contributions. Knowledge and Information
Systems, 41 (3), 647–665.
Sun, K. H., Huh, H., Tama, B. A., Lee, S. Y., Jung, J. H., & Lee, S. (2020).
Vision-Based Fault Diagnostics Using Explainable Deep Learning With Class
Activation Maps. IEEE Access, 8, 129169–129179.
Tabik, S., Gómez-Ríos, A., Martín-Rodríguez, J. L., Sevillano-García, I., Rey-Area,
M., Charte, D., Guirado, E., Suárez, J. L., Luengo, J., Valero-González,
M. A., García-Villanova, P., Olmedo-Sánchez, E., & Herrera, F. (2020).
COVIDGR Dataset and COVID-SDNet Methodology for Predicting
COVID-19 Based on Chest X-Ray Images. IEEE Journal of Biomedical and
Health Informatics, 24 (12), 3595–3605.
Tan, R., Khan, N., & Guan, L. (2020). Locality Guided Neural Networks for
Explainable Artificial Intelligence. Proceedings of the 2020 International
Joint Conference on Neural Networks (IJCNN).
ten Zeldam, S., Jong, A. D., Loendersloot, R., & Tinga, T. (2018). Automated Failure
Diagnosis in Aviation Maintenance Using eXplainable Artificial Intelligence
(XAI). Proceedings of the 4th European Conference of the PHM Society
(PHME).
van der Waa, J., Nieuwburg, E., Cremers, A., & Neerincx, M. (2021). Evaluating XAI:
A Comparison of Rule-based and Example-based Explanations. Artificial
Intelligence, 291, 103404.
van der Waa, J., Schoonderwoerd, T., van Diggelen, J., & Neerincx, M.
(2020). Interpretable Confidence Measures for Decision Support Systems.
International Journal of Human-Computer Studies, 144, 102493.
van Lent, M., Fisher, W., & Mancuso, M. (2004). An Explainable Artificial
Intelligence System for Small-unit Tactical Behavior. Proceedings of the
16th Conference on Innovative Applications of Artifical Intelligence (IAAI),
900–907.
Vilone, G., & Longo, L. (2020). Explainable Artificial Intelligence: A Systematic
Review. ArXiv, (arXiv:2006.00093v4 [cs.AI]).
Vilone, G., & Longo, L. (2021a). Classification of Explainable Artificial Intelligence
Methods through Their Output Formats. Machine Learning and Knowledge
Extraction, 3 (3), 615–661.
174
Paper B
Vilone, G., & Longo, L. (2021b). Notions of Explainability and Evaluation

Approaches for Explainable Artificial Intelligence. Information Fusion, 76,
89–106.
Vlek, C. S., Prakken, H., Renooij, S., & Verheij, B. (2016). A Method for Explaining
Bayesian Networks for Legal Evidence with Scenarios. Artificial Intelligence
and Law, 24 (3), 285–324.
Wachter, S., Mittelstadt, B., & Russell, C. (2018). Counterfactual Explanations
without Opening the Black Box: Automated Decisions and the GDPR.
Harvard Journal of Law & Technology, 31 (2), 841–887.
Wang, D., Yang, Q., Abdul, A., & Lim, B. Y. (2019). Designing Theory-Driven
User-Centric Explainable AI. Proceedings of the 2019 CHI Conference on
Human Factors in Computing Systems (CHI), 1–15.
Wang, X., Wang, D., Xu, C., He, X., Cao, Y., & Chua, T.-S. (2019). Explainable
Reasoning over Knowledge Graphs for Recommendation. Proceedings of the
Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First
Innovative Applications of Artificial Intelligence Conference and Ninth AAAI
Symposium on Educational Advances in Artificial Intelligence, 5329–5336.
Weber, R. O., Johs, A. J., Li, J., & Huang, K. (2018). Investigating Textual
Case-Based XAI. In M. T. Cox, P. Funk, & S. Begum (Eds.), Case-Based
Reasoning Research and Development (pp. 431–447). Springer International
Publishing.
Wohlin, C. (2014). Guidelines for Snowballing in Systematic Literature Studies and
a Replication in Software Engineering. Proceedings of the 18th International
Conference on Evaluation and Assessment in Software Engineering (EASE),
1–10.
Xu, F., Uszkoreit, H., Du, Y., Fan, W., Zhao, D., & Zhu, J. (2019). Explainable AI:
A Brief Survey on History, Research Areas, Approaches and Challenges. In
J. Tang, M.-Y. Kan, D. Zhao, S. Li, & H. Zan (Eds.), Natural Language
Processing and Chinese Computing (pp. 563–574). Springer International
Publishing.
Yang, S. C.-H., Vong, W. K., Sojitra, R. B., Folke, T., & Shafto, P. (2021). Mitigating
Belief Projection in Explainable Artificial Intelligence via Bayesian Teaching.
Scientific Reports, 11 (1), 9863.
Yang, W., Li, J., Xiong, C., & Hoi, S. C. H. (2022). MACE: An
Efficient Model-Agnostic Framework for Counterfactual Explanation. ArXiv,
(arXiv:2205.15540v1 [cs.AI]).
Yang, Z., Zhang, A., & Sudjianto, A. (2021). Enhancing Explainability of Neural
Networks Through Architecture Constraints. IEEE Transactions on Neural
Networks and Learning Systems, 32 (6), 2610–2621.
Yeganejou, M., Dick, S., & Miller, J. (2020). Interpretable Deep Convolutional Fuzzy
Classifier. IEEE Transactions on Fuzzy Systems, 28 (7), 1407–1419.
Zhang, K., Zhang, J., Xu, P.-D., Gao, T., & Gao, D. W. (2022). Explainable AI in
Deep Reinforcement Learning Models for Power System Emergency Control.
IEEE Transactions on Computational Social Systems, 9 (2), 419–427.
Zhang, Q., Wu, Y. N., & Zhu, S.-C. (2018). Interpretable Convolutional Neural
Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 8827–8836.
175
Zhao, G., Fu, H., Song, R., Sakai, T., Chen, Z., Xie, X., & Qian, X. (2019).
Personalized Reason Generation for Explainable Song Recommendation.
ACM Transactions on Intelligent Systems and Technology, 10 (4), 41:1–41:21.
Zheng, Q., Delingette, H., & Ayache, N. (2019). Explainable Cardiac Pathology
Classification on Cine MRI with Motion Characterization by Semi-supervised
Learning of Apparent Flow. Medical Image Analysis, 56, 80–95.
Zhong, Q., Fan, X., Luo, X., & Toni, F. (2019). An Explainable Multi-attribute
Decision Model based on Argumentation. Expert Systems with Applications,
117, 42–61.
176
Paper C
Local and Global Interpretability using Mutual

Information in Explainable Artificial Intelligence
Islam, M. R., Ahmed, M. U. & Begum, S. C

Paper C
Local and Global Interpretability

using Mutual Information in
Explainable Artificial Intelligence†
Abstract
Numerous studies have exploited the potential of Artificial Intelligence (AI)
and Machine Learning (ML) models to develop intelligent systems in diverse
domains for complex tasks, such as analysing data, extracting features,
prediction, recommendation etc. However, presently these systems embrace
acceptability issues from the end-users. The models deployed at the back
of the systems mostly analyse the correlations or dependencies between
the input and output to uncover the important characteristics of the input
features, but they lack explainability and interpretability that causing the
acceptability issues of intelligent systems and raising the research domain
of eXplainable Artificial Intelligence (XAI). In this study, to overcome these
shortcomings, a hybrid XAI approach is developed to explain an AI/ML
model’s inference mechanism as well as the final outcome. The overall
approach comprises of 1) a convolutional encoder that extracts deep features
from the data and computes their relevancy with features extracted using
domain knowledge, 2) a model for classifying data points using the features
from autoencoder, and 3) a process of explaining the model’s working
procedure and decisions using mutual information to provide global and
local interpretability. To demonstrate and validate the proposed approach,
experimentation was performed using an electroencephalography dataset
from road safety to classify drivers’ in-vehicle mental workload. The outcome
of the experiment was found to be promising that produced a Support
Vector Machine classifier for mental workload with approximately 89%
performance accuracy. Moreover, the proposed approach can also provide
† © 2021 IEEE. Reprinted, with permission, from Islam, M. R., Ahmed, M. U., & Begum,
S. (2021). Local and Global Interpretability using Mutual Information in Explainable Artificial
Intelligence. Proceedings of the 8th International Conference on Soft Computing & Machine
Intelligence (ISCMI), 191–195.
179
an explanation for the classifier model’s behaviour and decisions with the
combined illustration of Shapely values and mutual information.
Keywords: Autoencoder · Electroencephalography · Explainability ·
Feature Extraction · Mental Workload · Mutual Information.
C.1 Introduction
Recent developments of Artificial Intelligence (AI) and Machine Leaning (ML) have
been embraced in almost every domain in the form of automated and semi-automated
systems. However, with the growing popularity of these systems, the AI/ML
algorithms which act behind the systems, still endure acceptability issues due to
the lack of explanations on the algorithms’ inference mechanism and decisions.
Realising the dire need of explaining or interpreting AI/ML model-based intelligent
systems, the research domain of eXplainable Artificial Intelligence (XAI) emerged.
Currently, XAI research is immensely spreading to develop methods of generating
explanations to enhance the local and global interpretability of AI/ML models.
Global interpretability refers to interpreting any model’s inference mechanism,
whereas local interpretability indicates the understandability of a specific decision
from an AI/ML model (Guidotti et al., 2019). Several tools are already proposed
by researchers to generate explanations and interpretability of AI/ML models, such
as Local Interpretable Model Agnostic Explanations (LIME) (Ribeiro et al., 2016)
and SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2017). However,
the understandability of the explanations from these tools are highly dependent on
domain expertise.
Many fields from diverse domains have already been facilitated by XAI research,
such as, image processing (Wu et al., 2020), anomaly detection (Antwarg et al., 2021),
predictive maintenance (Serradilla et al., 2021) etc. On the contrary, safety-critical
domains concerning human life, e.g., road safety has received less attention from
the XAI researchers. Very few evidences are found in the literature like explaining
motorbike riding pattern (Leyli abadi & Boubezoul, 2021), whereas the depth of
research in XAI is still shallow for drivers. However, AI/ML approaches had been well
investigated for in-vehicle road safety features such as, drivers’ drowsiness detection
and intelligent speed assistance through utilising vehicular signals, neurophysiological
signals, etc. Specifically, neurophysiological signals, e.g. electroencephalography
(EEG) and electrocardiography (ECG), are one of the major tools for assessing
a driver’s in-vehicle performance (Borghini et al., 2014). The major challenge of
utilising EEG signals in an AI/ML approach is the feature extraction procedure
that demands high involvement of experts and manual computation. Automatic
approaches are already proved to be efficient in extracting features from EEG
leveraging the computation strength of convolutional neural network (CNN) based
autoencoder (Islam et al., 2019) but lacks in explainability of the extracted features.
Autoencoders of different architectures have been exploited in several studies to
explain diverse tasks, like forecasting energy demand (Kim & Cho, 2021), classifying
time series (Leonardi et al., 2020), detecting anomalies (Antwarg et al., 2021) and
changes in temporal images (Bergamasco et al., 2020), etc. Moreover, autoencoder
has been used to enhance the quality of explanations from different explainability
tools (Shankaranarayana & Runje, 2019). All of these works contribute to explain
180
Paper C
decisions or enhance explanations but no evidence was found on explaining the deep
features that can be extracted using autoencoder.
One of the major challenges of explaining model and/or decision is to extract
the underlying relation between the input and output. Recently, the concept of
mutual information has drawn attention of XAI researchers due to its naive nature of
quantifying relevancy between two random variables (Cover & Thomas, 2006). Upon
realising the potential of mutual information and the urge of explaining features
to induce global interpretability in AI/ML models, this study proposes a hybrid
approach of feature explanation using mutual information associating the explanation
generated by popular explainability tool SHAP. The idea solely depends on the fact
that the mutual information is a proper mean of domain knowledge as demonstrated
in several recent studies on recommender systems (Noshad et al., 2021), automated
fault diagnosis (Luo et al., 2019), feature extraction (Islam et al., 2020) etc.
Summarising, to expand the research domain of XAI and contribute to road safety,
this study aims at utilising the EEG signals recorded from car drivers’ to demonstrate
the proposed concept of explaining autoencoder extracted features using mutual
information, followed by explaining mental workload classification to achieve local
and global interpretability. To achieve the aim of this study, two major objectives
are set and stated below:
• To propose a novel approach of using mutual information to explain

autoencoder extracted EEG features.
• To demonstrate a hybrid methodology of explaining mental workload
classification from the autoencoder extracted features using SHAP and mutual
information to induce local and global interpretability.
The remaining parts of this article are arranged as follows, Section C.2 contains
description of the materials and methods. In Section C.3, obtained results are
presented discussed thoroughly. Finally, conclusion of this study and possible
research directions are stated in Section C.4.
C.2 Materials and Methods
C.2.1 Data Acquisition and Preprocessing

The data, specifically the EEG signals, that was analysed in this study was collected
under the framework of the project BrainSafeDrive1 , through an natural driving
experiment in a route around the urban areas of the periphery of Bologna, Italy.
During the experiment, 20 male participants drove along a 2.5km long circuit route
twice in normal and rush hour randomly for three laps. Moreover, each circuit
consisted of easy and hard segments containing road through busy industrial area
and comparatively quite residential area, respectively. The experimental road, hour
and segments were selected to induce the drivers with different levels of workload.
Additional description of the experimental protocol are available in the articles –
(Di Flumeri et al., 2018; Islam et al., 2020).
1 http://brainsafedrive.brainsigns.com
181
To record EEG signals, the digital monitoring BEmicro system (EBNeuro, Italy)
was used with active 15 EEG channels (F P z, AF 3, AF 4, F 3, F z, F 4, P 5, P 3, P z,
P 4, P 6, P Oz, O1, Oz and O2) placed according to the 10 − 20 International System.
The sampling frequency was 256Hz and the channel impedance was kept below
20kΩ. During the experiments raw EEG signals were recorded and the processing
was applied offline. In particular, each EEG signal has been firstly band-pass filtered
with a fourth-order Butterworth infinite impulse response (IIR) filter (high-pass filter
cut-off frequency: 1Hz, low-pass filter cut-off frequency: 30Hz). Afterwards, ARTE
(Automated aRTifacts handling in EEG) algorithm (Barua et al., 2018) was deployed
to remove various artefacts such as, drivers’ movements and environmental noises,
from the recorded EEG signals. Finally, the EEG signals were sliced into epochs
of 2s (0.5Hz of the frequency resolution) length using sliding window technique
with a stride of 0.125s keeping an overlap of 1.875s between two continuous epochs.
The windowing technique was performed to obtain higher number of observations in
comparison with the number of variable and to contain the stationarity condition of
the EEG signals (Elul, 1969).
C.2.2 Feature Extraction

The feature extraction process was performed from two different perspectives. First,
the features were extracted based on the Power Spectral Density (PSD) to incorporate
domain knowledge in the feature set. In the second approach, convolutional
autoencoder was developed to extract features from the EEG signals to contain
deeper insights of the data and reduce human involvement. Both the approaches
are briefly described below.
C.2.2.1 Features from Power Spectral Density

From the clean and segmented EEG signals, the PSD has been calculated for each
EEG channel for each epoch using the Fast Fourier Transformation (FFT) and a
Hanning window of equal epoch length, i.e., 2s. Then, the EEG frequency bands
of interest has been defined for each subject by estimating the Individual Alpha
Frequency (IAF) value (Corcoran et al., 2018). The IAF value was determined as
the peak of the general alpha rhythm frequency (8 − 12Hz). Subsequently, average
frequency of the theta band [IAF −6, IAF −2], the alpha band [IAF −2, IAF +2] and
the beta band [IAF +2, IAF +18], over all the EEG channels were calculated. Finally,
a spectral feature vector containing 45 features (15 EEG channels × 3 Frequency
bins) has been obtained from the frequency bands directly correlated to the mental
workload, as manifested in the previous scientific literature (Borghini et al., 2014).
In fact, one of the prime biomarkers of human mental workload is the ratio between
Frontal Theta and Parietal Alpha spectral content (Borghini et al., 2014).
C.2.2.2 Features from Convolutional Autoencoder

Traditionally, the convolutional autoencoder architecture consists of two segments,
(i) encoder and (ii) decoder. A number of convolutional layers associated with
pooling layers form the encoder segment to find the deep hidden features in the
original signal. On the contrary, Decoder contains several deconvolutional layer to
182
Paper C
reconstruct the input signal from the features through minimising the residuals. The
autoencoder trains through the process of encoding and reconstruction of predefined
epochs and batch size. Here, several tweaking of the number of convolutional layers
and associated parameters were performed and the encoder was finalised with three
convolutional layers and three max-pooling layers followed by a flattening layer.
Table C.1 presents the summary of the layers of the encoder with a total of 732
parameters to train. The output shape of of the input layer is (512, 16, 1) that
contains 1 clean EEG signal epoch of length 2s (at 256Hz sampling frequency) from
15 channels and one channel was introduced with zeros to facilitate the design of the
encoder. The decoder was designed in the inverse order of the structure of the encoder
containing four convolutional layers and three upsampling layers facilitating the
depooling mechanism. In each of the convolutional layers, batch normalisation with
ReLU activation function was invoked with zero padding. The developed autoencoder
utilised RMSprop optimisation with a learning rate of 0.002 and binary cross-entropy
as the loss function. Finally, 32 features were extracted from the cleaned EEG epochs
in accordance to the output shape of the flattening layer of the encoder.
Table C.1: Summary of the designed convolutional encoder. © 2021 IEEE.
Layer Type Output Shape No. of Parameters

Input (512, 16, 1) 0
Convolutional (256, 8, 16) 80
MaxPooling (128, 4, 16) 0
Flattening (32) 0
After the preparation of feature sets, labels were added to the feature vectors
according to the experimental road segment and time of driving based on the
experimental design. Specifically, the feature vectors extracted from driving sessions
on hard road segment during rush hour was labelled as high mental workload. On
the other hand, low mental workload labels were added to the features extracted
from the data recorded during normal hour while driving on easier road segment as
prescribed by the experts in the experimental protocol (Di Flumeri et al., 2018; Islam
et al., 2020).
C.2.3 Explanation of Extracted Features

The features extracted from the convolutional autoencoder are based on the
underlying characteristics of the input data, in this study, the EEG signals.
To understand and explain the features, mutual information was used to prove
the relevance between the spectral features and the autoencoded features. The
mutual information between two random variables is a metric to quantify the
mutual dependence between the two variables. For measuring linear and nonlinear
correlation, it is an ideal criteria. The mutual information has been considered as the
base for many well-known methods, such as hidden Markov models and decision trees
183
(Luo et al., 2019). In fact, a recent study showed the use of mutual information in
developing combined feature set from correlated features from different measurements
(Islam et al., 2020).
Theoretically, If X and Y are continuous random variables where X, Y ∈ Rd , the
mutual information between X and Y is termed as I(X, Y ) and formulated as shown
in Equation C.1 (Cover & Thomas, 2006).
Z Z
p(y, x)
I(X, Y ) = p(y, x) log2 dx dy (C.1)
y x p(y) p(x)
In this study, Fs and Fa were considered for spectral and autoencoder extracted
features, respectively depicting X and Y as stated in Equation C.1. Thus, computing
the mutual information I(Fs , Fa ) generates the means of explaining the autoencoder
extracted features by the spectral features as a substitute of domain knowledge.
Afterwards, for better understanding of the explanation, the mutual information
values are illustrated using Chord diagram (Tintarev et al., 2018) for the whole
model or a single decision.
C.2.4 Mental Workload Classification

In order to classify drivers’ mental workload from EEG features, Random Forest
(RF) and Support Vector Machine (SVM) have been invoked leveraging the outcome
of the previous studies – Islam et al. (2019), Islam et al. (2020). In the cited studies,
authors compared the selected classifiers with several other AI/ML models such as,
k-Nearest Neighbours (kNN), Multi-Layer Perceptron (MLP) and Logistic Regression
and reported maximum accuracy by the selected ones in the binary classification of
drivers’ mental workload into high and low. However, in this study, different kernel
functions, e.g., Linear, Polynomial, Radial Basis Function (RBF) and Sigmoid kernels
for SVM have been deployed to investigate and report the change in performance
metrics while classifying mental workload. Again, the varying number of estimators
and depths were investigated while RF model was trained. After training and
validating with 5-fold cross validation of the aforementioned variants of the models,
the classifiers resulting maximum performance accuracy were chosen to train the
concluding model and generate explanation at global and local scope.
C.2.5 Explanation of Mental Workload Classification

To explain the SVM classifier trained to classify mental workload from autoencoder
extracted features, open-source explainability tool SHAP was used. At first,
explanations were generated at global and local scope by invoking the built-in
functions. The main components of the explanations are the Shapley values
associated with the autoencoder extracted features that control the behaviour of the
model as a whole and for each individual classification tasks. Furthermore, following
the method described in Section C.2.3, with pre-computed mutual information
values the Chord diagrams were drawn that illustrates the relevance between the
autoencoder extracted features and the grouped spectral features, i.e., Theta, Alpha
and Beta on the basis of Frontal and Parietal scalp location. In Section C.3.2,
Figures C.1 and C.2 presents the global and local explanations, respectively.
184
Paper C
C.3 Results and Discussion
The outcome of this study is presented in this section from two different aspects-
mental workload classification and explaining the trained classifier model followed
by explaining a single decision. For each of the aspects, the results are discussed in
corresponding subsections.
C.3.1 Mental Workload Classification

For mental workload classification, the analysed dataset initially contained 65507
instances, where 36630 and 28877 instances were labelled as low and high mental
workload, respectively. As the complete dataset were substantially large, the
instances of low were randomly down-sampled to match the number of instances
labelled as high that leaves the final dataset size to 57754. Both the RF and SVM
classifiers were trained and cross validated with 5-fold cross validation. After the
training phase, the performance metrics were calculated with the hold-out dataset,
where the total number of observations was 11550 and low mental workload was
considered as the positive class. Table C.2 presents the performance metrics of the
mental workload classifiers. The performance metrics were selected depending on
the balanced characteristics of the dataset. From the summary, it was observed that
SVM with RBF kernel produced the highest classification accuracy. On the other
hand, linear kernel came out to be most unsuccessful classifier in classifying mental
workload. This signifies the non-linear characteristics of the EEG signals.
Table C.2: Performance summary of mental workload classification using RF and

SVM classifier model on the holdout test set. © 2021 IEEE.
Classifier Accuracy Precision Recall F1 score

RF 88.59% 0.9995 0.7723 0.8713
SVM 89.45% 0.9831 0.7876 0.8746
C.3.2 Global and Local Explanation

Performing the approaches described in Section C.2.3 and C.2.5, the explanations
are generated using SHAP. At first a summary plot has been drawn using built-in
functions that illustrate the prime features inferring the model’s decision in terms
of Shapley values. Furthermore, the spectral features associated mutual information
values to the autoencoder extracted features are illustrated using Chord diagram to
generate global explanation as illustrated in Figure C.1. For a single instance, local
explanation is generated using SHAP values contributing to the decision (Figure
C.2). Here, similar association between the autoencoder and spectral feature groups
can also be shown as it is illustrated in global explanation.
Currently, several explainability tools are available to generate explanation for
models’ decisions, in other terms, at local scope. LIME is one of the methods
to produce local explanation. However, here SHAP has been used for enhancing
interpretability of mental workload classifier model since it has the capability to
produce both local and global explanation that aligns with the objective of this
185
Figure C.1: Global explanation of mental workload classifier model with SHAP values
with bar plot (left) and mutual information illustrated with Chord diagrams for six
spectral feature groups (right). © 2021 IEEE.
study. But, the difficulty of understanding the Shapley values associated with the
autoencoder extracted features for the end users had been overcome using spectral
feature groups of EEG signals. Mutual information values, in naive term, relevancy
between the spectral features and autoencoder features were calculated followed by
the representation of Chord diagrams to facilitate the domain experts.
Figure C.2: Example of a local explanation with SHAP. © 2021 IEEE.
C.4 Conclusion
The contribution presented in this article is twofold: (1) proposal and illustration of
a novel approach of using mutual information to explain EEG features extracted by
convolutional autoencoder; this approach, to our knowledge, is the only procedure to
explain the autoencoder extracted features; (2) demonstration of explaining drivers’
mental workload classification at local and global scope, based on autoencoder
extracted EEG features using SHAP and mutual information. In a broader terms,
explaining EEG signal classification that can be further adopted in other domains
utilising the EEG signals.
The experimental results of this study have been encouraging, but there is
space for improvements and further research. In terms of deep learning techniques,
investigating other architectures, such as Recurrent Neural Network (RNN) as a
combined alternative to the working sequence of autoencoder and RF or SVM
186
Paper C
classifier. As regards explainability, exploiting similar parameters to mutual

information for explaining the features in more understandable form incorporating
domain knowledge. Moreover, improving the quality and form of explanations both
at feature and decision level through a validation phase that involves experts and
end users.
Acknowledgement. This article is based on the study performed as a part

of the project BrainSafeDrive, co-funded by the Vetenskapsrådet - The Swedish
Research Council and the Ministero dell’Istruzione dell’Università e della Ricerca
della Repubblica Italiana under the Italy-Sweden Cooperation Program.
Bibliography
Antwarg, L., Miller, R. M., Shapira, B., & Rokach, L. (2021). Explaining Anomalies
Detected by Autoencoders using Shapley Additive Explanations. Expert
Systems with Applications, 186, 115736.
Bergamasco, L., Saha, S., Bovolo, F., & Bruzzone, L. (2020). An Explainable
Convolutional Autoencoder Model for Unsupervised Change Detection. The
International Archives of the Photogrammetry, Remote Sensing and Spatial
Information Sciences, XLIII-B2-2020, 1513–1519.
Borghini, G., Astolfi, L., Vecchiato, G., Mattia, D., & Babiloni, F. (2014). Measuring
Neurophysiological Signals in Aircraft Pilots and Car Drivers for the
Assessment of Mental Workload, Fatigue and Drowsiness. Neuroscience &
Biobehavioral Reviews, 44, 58–75.
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.).
John Wiley & Sons, Inc.
Kim, J.-Y., & Cho, S.-B. (2021). Explainable Prediction of Electric Energy Demand
using a Deep Autoencoder with Interpretable Latent Space. Expert Systems
with Applications, 186, 115842.
187
Leonardi, G., Montani, S., & Striani, M. (2020). Deep Feature Extraction for
Representing and Classifying Time Series Cases: Towards an Interpretable
Approach in Haemodialysis. Proceedings of the Thirty-Third International
Florida Artificial Intelligence Research Society Conference (FLAIRS),
417–420.
Luo, X., Li, X., Wang, Z., & Liang, J. (2019). Discriminant Autoencoder for Feature
Extraction in Fault Diagnosis. Chemometrics and Intelligent Laboratory
Systems, 192, 103814.
Noshad, Z., Bouyer, A., & Noshad, M. (2021). Mutual Information-based
Recommender System using Autoencoder. Applied Soft Computing, 109,
107547.
Mining (KDD), 1135–1144.
Serradilla, O., Zugasti, E., Ramirez de Okariz, J., Rodriguez, J., & Zurutuza, U.
(2021). Adaptable and Explainable Predictive Maintenance: Semi-Supervised
Deep Learning for Anomaly Detection and Diagnosis in Press Machine Data.
Shankaranarayana, S. M., & Runje, D. (2019). ALIME: Autoencoder Based
Approach for Local Interpretability. In H. Yin, D. Camacho, P. Tino, A. J.
Tallón-Ballesteros, R. Menezes, & R. Allmendinger (Eds.), Intelligent Data
Engineering and Automated Learning – IDEAL 2019 (pp. 454–463). Springer
Wu, S.-L., Tung, H.-Y., & Hsu, Y.-L. (2020). Deep Learning for Automatic Quality
Grading of Mangoes: Methods and Insights. 2020 19th IEEE International
Conference on Machine Learning and Applications (ICMLA), 446–453.
188
Paper D
Interpretable Machine Learning for Modelling and

Explaining Car Drivers’ Behaviour: An Exploratory
Analysis on Heterogeneous Data
Islam, M. R., Ahmed, M. U. & Begum, S.
D
Paper D
Interpretable Machine Learning for

Modelling and Explaining Car
Drivers’ Behaviour: An Exploratory
Analysis on Heterogeneous Data†
Abstract
Understanding individual car drivers’ behavioural variations and
heterogeneity is a significant aspect of developing car simulator technologies,
which are widely used in transport safety. This also characterizes the
heterogeneity in drivers’ behaviour in terms of risk and hurry, using
both real-time on-track and in-simulator driving performance features.
Machine learning (ML) interpretability has become increasingly crucial for
identifying accurate and relevant structural relationships between spatial
events and factors that explain drivers’ behaviour while being classified
and the explanations for them are evaluated. However, the high predictive
power of ML algorithms ignore the characteristics of non-stationary domain
relationships in spatiotemporal data (e.g., dependence, heterogeneity),
which can lead to incorrect interpretations and poor management decisions.
This study addresses this critical issue of ‘interpretability’ in ML-based
modelling of structural relationships between the events and corresponding
features of the car drivers’ behavioural variations. In this work, an
exploratory experiment is described that contains simulator and real
driving concurrently with a goal to enhance the simulator technologies.
Here, initially, with heterogeneous data, several analytic techniques for
simulator bias in drivers’ behaviour have been explored. Afterwards, five
different ML classifier models were developed to classify risk and hurry in
drivers’ behaviour in real and simulator driving. Furthermore, two different
† © 2023 by SCITEPRESS (CC BY-NC-ND 4.0). Reprinted, with permission, from Islam,
M. R., Ahmed, M. U., & Begum, S. (2023). Interpretable Machine Learning for Modelling and
Explaining Car Drivers’ Behaviour: An Exploratory Analysis on Heterogeneous Data. Proceedings
of the 15th International Conference on Agents and Artificial Intelligence (ICAART), 392–404.
191
feature attribution-based explanation models were developed to explain

the decision from the classifiers. According to the results and observation,
among the classifiers, Gradient Boosted Decision Trees performed best with
a classification accuracy of 98.62%. After quantitative evaluation, among
the feature attribution methods, the explanation from Shapley Additive
Explanations (SHAP) was found to be more accurate. The use of different
metrics for evaluating explanation methods and their outcome lay the path
toward further research in enhancing the feature attribution methods.
Keywords: Artificial Intelligence · Driving Behaviour · Feature
Attribution · Evaluation · Explainable Artificial Intelligence ·
Interpretability · Road Safety.
D.1 Introduction
Artificial Intelligence (AI) and Machine Learning (ML) models are the basis of
intelligent systems and continuously gaining popularity across diverse domains. The
prime reason behind the models’ growing popularity is the outstanding and accurate
computation of features and the prediction based on the features. Among the AI/ML
facilitated domains, the transportation domain is notably using different models
within the framework of driving simulators. Driving simulators are increasingly
adopted in different countries for diverse objectives, e.g., driver training, road safety,
etc. (Sætren et al., 2019).
In conjunction with the increased demands on explanations for the decisions
of AI/ML models in other domains, the need for explanation is also rising for
the automated actions in the simulators. However, different fields from other
domains are already facilitated with the eXplainable AI (XAI) research, e.g., anomaly
detection (Antwarg et al., 2021), predictive maintenance (Serradilla et al., 2021),
image processing (Wu et al., 2020) etc. conversely, road safety related simulator
development and enhancement have been less exploited in XAI research. Though
there are very few studies are available in the literature that explained the riding
patterns of motorbikes (Leyli abadi & Boubezoul, 2021), explaining drivers’ fatigue
prediction (Zhou et al., 2022), etc., research studies on drivers’ behaviours are scarce
in terms of XAI. In addition, the research on the evaluation of explanations for the
predictions or decisions of an AI/ML model is also in nurturing state.
Realising the need for research to enhance the simulation technologies and the
complementary requirement for the development of the explanation models this
research study was conducted. The main objective of the work presented in this
paper can be outlined as-
• Explore the variation of drivers’ behaviour in the simulator and track driving
to enhance the simulator technologies.
• Develop classifiers for drivers’ behaviour in terms of risk and hurry while
driving.
• Explain the decisions of drivers’ behaviour classifiers and evaluate the
explanations.
192
Paper D
The remaining sections of this paper are organised as follows: Section D.2
introduces the materials and methodologies used in this study. The results and
corresponding discussions on the findings are presented in Section D.3. Finally,
Section D.4 contains the concluding remarks and directions for future research works.
D.2 Materials and Methods
This section contains a detailed description of the experimental protocol, data

collection, feature extraction, development of classifiers and explanation generation
at local and global scope.
Figure D.1: The experimental route for simulation and track tests. A detailed
description is presented in Section D.2.1. © 2023 by SCITEPRESS (CC BY-NC-ND
4.0).
D.2.1 Experimental Protocol

The experiment to collect data for this study was conducted under the framework of
the European Union’s Horizon 2020 project SIMUSAFE1 (SIMUlation of behavioural
aspects for SAFEr transport). Sixteen drivers were recruited for participating in the
study. There were both male and female drivers. They were selected from two
age groups 18-24 and 50+ years representing inexperienced and experienced drivers
respectively. The participants were selected in such an order to have a homogeneous
experimental group in terms of age, sex and driving experience. The participants were
properly instructed about the experiments through information meetings. Informed
consent and authorisation to use the acquired data in the research were obtained
from each participant on paper. Throughout the experimental process, General
Data Protection Regulation (GDPR) (Voigt & Von Dem Bussche, 2017) was strictly
followed.
The experimental protocol was outlined in accordance with the aim of the project
SIMUSAFE; to improve driving simulator and traffic simulation technology to safely
assess risk perception and decision-making of road users. To partially achieve the
193
Figure D.2: The car simulator developed with DriverSeat 650 ST was used for
conducting the simulation tests. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
aim, the experiment was planned with the simulator and track driving tests. In both
the simulation and track tests, participant drivers were required to drive along the
identical route for seven laps with different variables. This design further facilitated
the analysis of varying behaviour while driving on track and simulation. The route of
the experiment is illustrated in Figure D.1. For the track test, the route was prepared
with proper road markings, signals etc. in an old airport in Kraków, Poland. In
simulation tests, a modified variant of DriverSeat 650 ST (Figure D.2) simulation
cockpit was used. As annotated in Figure D.1, each participant started the lap
from point A, drove straight up to the roundabout at point B, took the third exit
of the roundabout, drove up to point C to take a right turn, drove straight up to
point D then took a U-turn and came back to point C for a left turn and then
drove through points B (roundabout), E (right turn), C (left turn) and finishes at
point F after a left curve. For the simulation test, a similar route was designed
virtually where the participants drove following the same protocol. In both tests, a
participant drove through the route for seven laps with different scenarios containing
varied environmental and driver variables as outlined in Table D.1. The scenarios
associated with the laps were designed with the consultation of psychologists and
domain experts.
D.2.2 Data Collection

During the whole protocol, vehicular signals, physiological signals, psychological data
and videos were recorded for each participant. In this study only the vehicular
and physiological signals, specifically, EEG, have been exploited. All the data were
properly anonymized to comply with the GDPR. The data collection methods and
materials are briefly described in the following sections.
194
Paper D
Table D.1: Associated scenarios for the laps of the experimental simulator and track
driving with varying driving conditions. © 2023 by SCITEPRESS (CC BY-NC-ND
4.0).
Environmental
Driver Variables
Lap Variables Scenario
Events Traffic Habituation Hurry Frustration Surprise
1 Round- Low No No No
2 about, No Low No No No Drive along the
3 Left High No No No route.
Turn,
4 Yes High No No No
Inter-
5 section No High Yes No No Drive along the route
with no and finish as quickly
6 Yes High Yes Yes No as possible.
Traffic
7 Lights No High No No Yes Drive along the route.
D.2.2.1 Vehicular Signal

The acquiring of the vehicular signals as numeric descriptive information was done
using onboard instruments accessed via vehicle Controlled Area Network (CAN)
and Inertial Measurement Unit (IMU). The signals contained information on the
parameters like vehicle speed, acceleration, steering wheel angle, accelerator and
brake pedal positions, Global Positioning System (GPS) coordinates, yaw, roll, pitch,
etc. For track tests, the signals were directly acquired from the vehicle unit and for
simulations, the measurements were recorded from the simulation framework. In
both cases, the recording frequency was 15Hz.
D.2.2.2 Biometric Signal

During both tests, i.e., simulation and track, the biometric signals in terms of EEG
were recorded using the SAGA 32+ Systems2 (TMSi, The Netherlands). Sixteen
EEG channels (F p1, F pz, F p2, F 7, F 3, F z, F 4, F 8, P 7, P 3, P z, P 4, P 8, O1, Oz,
and O2), placed according to the 10–20 International System with a Brainwave EEG
Head caps, were collected with a sampling frequency of 256Hz, grounded to the Cz
site. During the experiments, raw EEG data were recorded and afterwards digitally
filtered using a band-pass filter (2 − 70Hz) in TMSi Saga Interface with FieldTrip
(Oostenveld et al., 2010) integration. Finally, ARTE (Automated aRTifacts handling
in EEG) (Barua et al., 2018) algorithm was used to remove the artefacts from
the band-pass filtered signals. This step was necessary because the artefacts, e.g.,
eyes-blinks, could affect the frequency bands correlated to the target measurements.
However, this method allows cleaning the EEG signal without losing data and without
requiring additional sensors, e.g., electro-oculographic sensors.
D.2.2.3 Event Extraction

The presented work within the framework of the SIMUSAFE project focused on risk
perception, handling and hurry of drivers in urban manoeuvres that expose higher
2 https://www.tmsi.com/products/saga-for-eeg
195
levels of risk. In risky situations, prime events were short-listed by experts including
roundabouts, left turns, extensive breaking/acceleration, etc. As per the experts’
opinion, the events were defined based on the road infrastructure. To label the
acquired data, all the GPS coordinates were plotted and overlaid on the experimental
track to identify the specific GPS coordinates where an event could occur. Figure
D.3 illustrates the event extraction from GPS coordinates using overlaid scatter
plot. Considering the GPS coordinates within the red rectangles in Figure D.3 and
consulting with domain experts and psychologists the data points were complemented
with corresponding events. Figure D.4 illustrates the recorded GPS coordinates of a
single lap categorised on the basis of road infrastructure as events in different colours.
The extracted events are further discussed in Section D.3.1.
Figure D.3: Event extraction using GPS coordinates. Red rectangles mark the
significant areas of events, e.g., roundabout, left turn, signal with pedestrian crossing
etc. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
D.2.3 Dataset Preparation

The dataset for the presented work contains two separate sets of features and two
different labels, i.e., risk and hurry. The features were extracted from the data
collected from simulation and track tests. The process of feature extraction was
performed in two folds after the events of interest were extracted through the
utilisation of experts’ annotation on the raw data, i.e., specific timestamps of the
events’ start and end. Based on the experts’ annotation, for both vehicular and EEG
signals, the raw data was chunked into epochs of 2 seconds using moving window with
a shift of 0.125 second to preserve the condition of stationarity of the time-series data.
Firstly, the vehicular features were extracted. In the second step, EEG features in
the frequency domain were extracted and synchronised with the vehicular features
on the basis of the timestamps of data recording. Finally, the dataset is prepared by
combining the extracted features with the events and experts’ annotated labels.
The vehicular feature sets were populated using the signals from vehicle CAN
and IMU. The major features extracted from the vehicle CAN are speed, accelerator
196
Paper D
Figure D.4: GPS coordinates of a single lap driving colour-coded with respect to
different road structures. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
pedal position and steering wheel angle. The average and standard deviation of
these measures were were calculated within the start and end time of the events
annotated by the experts. These features were gathered in the feature list including
the maximum value for speed only resulting in 7 features. From IMU, the parameters
for angular and linear acceleration were considered and 9 features were calculated.
All the features extracted from the vehicular signals are listed in Table D.2.
Table D.2: List of features extracted from vehicular signals. © 2023 by SCITEPRESS
(CC BY-NC-ND 4.0).
Feature Name Count Source

Max. Speed
Avg. Speed
Std. Dev. Speed
Avg. Accelerator Pedal Pos. 07 CAN
Std. Dev. Accelerator Pedal Pos.
Avg. Steering Angle
Std. Dev. Steering Angle
Yaw
Yaw Rate
Roll
Roll Rate
Pitch 09 IMU
Pitch Rate
Lateral Acceleration
Longitudinal Acceleration
Vertical Acceleration
Avg.- Average, Max.- Maximum, Pos.- Position, Std.
Dev.- Standard Deviation.
197
From the curated EEG signals, 14 frequency domain features were extracted
from the power spectral density values. At first, the Individual Alpha Frequency
(IAF) (Corcoran et al., 2018) values were estimated as the peak of the general alpha
rhythm frequency (8 − 12Hz). Eventually, the average frequency of the theta band
[IAF −6, IAF −2], alpha band [IAF −2, IAF +2] and beta band [IAF +2, IAF +18],
over all the aforementioned EEG channels were calculated. Next, the channels were
partitioned on the basis of frontal and parietal locations on the scalp. For alpha and
beta bands, frontal and parietal parts were again divided into two segments; upper
and lower. For each of the segments, the average values of the frequency bands
were considered as a feature, thus, obtaining a total of fourteen biometric features.
Table D.3 presents the list of the extracted biometric features that have been further
deployed in classification tasks.
Table D.3: List of biometric features considering different frequency bands of EEG
signal. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
Feature Name Count Source

Frontal Theta
Parietal Theta
Frontal Alpha
Lower Frontal Alpha
Upper Frontal Alpha
Parietal Alpha
Lower Parietal Alpha
14 EEG
Upper Parietal Alpha
Frontal Beta
Lower Frontal Beta
Upper Frontal Beta
Parietal Beta
Lower Parietal Beta
Upper Parietal Beta
Summarising, a total of 30 features were extracted from the vehicular and

biometric data recorded from the simulation and track tests. Among those, 16
features were extracted from the vehicle CAN & IMU sensors and 14 features
were extracted from EEG signals. In addition to the libraries mentioned in
respective sections, Python libraries NumPy and Pandas were also employed for
data preparation.
After the feature extraction, the data points were clustered into various events
as described in Section D.2.2.3. For each event, the data point was labelled with
associated risk and hurry based on the laps of the experimental protocol (Table
D.1) and psychologists’ assessment. Each instance was labelled with yes or no for
risk and hurry depending on their presence in the behaviour of the corresponding
participant. The procedure produced 1771 data instances with varied numbers of
instances for different labels of risk and hurry. Initially, the dataset was found to
be largely imbalanced. To enhance the further analysis the instances with minority
class for both risk and hurry were upsampled using SMOTE (Chawla et al., 2002).
Table D.4 presents the summary of the dataset.
198
Paper D
Table D.4: Summary of the datasets from the simulator and track experiments for risk
and hurry classification. The values represent the number of instances for corresponding
labels of the classification tasks before applying SMOTE. © 2023 by SCITEPRESS
(CC BY-NC-ND 4.0).
Experiment
Classification Label Total
Simulation Track
Yes 330 215 545
Risk
No 696 530 1226
Yes 201 19 220
Hurry
No 825 726 1551
Total Instance 1026 745 1771
D.2.4 Classifier and Explanation Models

This section briefly describes the models invoked in the presented work. Prior to the
discussion on the models, the utilized dataset is theoretically formulated here. The
data prepared as described in Section D.2.3 is D comprising of feature set X and
labels Y , i.e., D = (X, Y ). Each instance xi ∈ X where i = 1, ..., n, contains features
fj ∈ F where j = 1, ..., m. The labels yi ∈ Y are associated with the corresponding
instance xi ∈ X which varies on different classification tasks, i.e., risk and hurry. For
all the tasks, D is split into Dtrain and Dtest at a ratio of 80 : 20 respectively.
D.2.4.1 Classifier Models

The intended task is to classify risk and hurry separately which sets the context
towards classification model c(xi ). In all cases, c(xi ) is trained using the instances of
Xtrain ⊂ X to predict the labels yˆi . The parameter tuning of c(xi ) was performed
by comparing the yˆi and yi ∈ Ytrain ⊂ Y .
The selection of a candidate of c(xi ) was done considering the performances of
modelling car drivers’ actions using different AI/ML models with a similar feature set
from a previous work (Islam et al., 2020). Initially, four different classifiers have been
tested to classify risk and hurry. The models are namely Logistic Regression (LR),
Multilayer Perceptron (MLP), Random Forest (RF) and Support Vector Machine
(SVM). In addition to these models, Gradient Boosted Decision Trees (GBDT) have
been also tested for the described classification tasks. GBDT has been introduced
in this study as an ensemble model which complements the use of different types of
AI/ML models. The training parameters for all the models were tuned using grid
search and 5-fold cross validation. All the corresponding parameters for the selected
models are presented in Table D.5 that were tested in the grid search. The chosen
parameters for the classifiers are also highlighted in the summary table. Python
Scikit Learn (Pedregosa et al., 2011) library was invoked for training, validating and
testing the classifier models.
D.2.4.2 Explanation Models

Literature indicates feature attribution methods are common choices for tabular data
(Liu et al., 2021; Islam et al., 2022). A feature attribution method can be denoted
199
Table D.5: Parameters used in tuning different AI/ML models for classifying risk and
hurry in driving behaviour with 5-fold cross validation. The parameters used for final
training are highlighted in bold font. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
Classifier Models Parameter Details

Estimators: [100, 200, 300, 400, 500]
Gradient Boosted
Learning Rate: [1e−3 , 1e−2 , 1e−1 , 1]
Decision Trees
(GBDT) Max. Depth: [1, 3, 5, 7, 9]
Loss : [deviance, exponential]
C: [1e−4 , 1e−3 , 1e−2 , 1e−1 , 1, 1e1 , 1e2 , 1e3 , 1e4 ]
Logistic Regression
Penalty: [l1, l2]
(LR)
Solver: [liblinear]
Hidden layers: [(32, 16, 8, 4), (32, 16, 4), (16, 8, 4)]
Multilayer Activation: [identity, logistic, tanh, relu]
Perceptron (MLP) Alpha: [1e−4 , 1e−3 , 1e−2 ]
Solver: [adam, lbf gs, sgd]
Estimators: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
Random Forest
Criterion: [gini, entropy]
(RF)
Max. Features: [20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 ]
C: [1, 1e1 , 1e2 , 1e3 ]
Support Vector
Gamma: [1e−5 , 1e−4 , 1e−3 , 1e−2 , 1e−1 , 1]
Machine (SVM)
Kernel: [linear, poly, rbf , sigmoid]
as f that estimates the importance w of each feature to the prediction. That is, for
a given classifier model c and a data point xi , f (c, xi ) = ω ∈ Rm . Here, each ωj
refers to the relative importance of feature j for the prediction c(xi ). Among the
feature attribution methods, Shapley Additive Explanations (SHAP) (Ribeiro et al.,
2016) and Local Interpretable Model-Agnostic Explanation (LIME) (Lundberg &
Lee, 2017) are exploited in this work as being popular choices in present research
works (Islam et al., 2022). Both the explanation models were built for GBDT and
Dt est to generate local and global explanations. TreeExplainer was invoked for SHAP
to complement the characteristics of GBDT and LIME was trained with default
settings from the corresponding library.
D.2.5 Evaluation
The evaluation of the presented work has been performed in two folds: evaluating
the performance of the classification models in classifying risk and hurry in drivers’
behaviour and evaluating the feature attribution using SHAP & LIME to explain
the classification. The metrics used for both evaluations are briefly described in the
following subsections.
D.2.5.1 Metrics for Classification Model

Considering the binary classification for both risk and hurry, the confusion matrix
(Figure D.5) has been used as the base of the evaluation of classifier models, c(x).
In both the classification tasks, the presence of risk or hurry is considered as the
positive label and absence is considered as the negative label. In the confusion
200
Paper D
matrix, True Positive (TP) and False Negative (FN) are the numbers of correct and
wrong predictions respectively for the positive class, i.e., Yes (1). On the other hand,
False Positive (FP) and True Negative (TN) are the numbers of wrong and correct
predictions respectively for the negative class, i.e., No (0).
Figure D.5: Confusion Matrix for both Risk and Hurry Classification. © 2023 by
SCITEPRESS (CC BY-NC-ND 4.0).
As described in Section D.2.3 the dataset was prepared as a balanced dataset.

Considering this, the metrics to evaluate the performance of c(x) are selected to be
Accuracy, Precision, Recall and F1 score as prescribed (Sokolova & Lapalme, 2009).
D.2.5.2 Metrics for Explanation Model

The performances of the explanation models were measured using three different
metrics; accuracy, Normalized Discounted Cumulative Gain (nDCG) score
(Busa-Fekete et al., 2012) and Spearman’s rank correlation coefficient (ρ) (Zar, 1972).
The accuracy scores for the explanation models were computed as the percentage
of local prediction by the explanation model that matches the classifier model, i.e.,
|c(x)≡f (x)|
|Xtest | . This metric would reflect how close the explanation models mimic the
prediction of the classifier models.
To assess the feature attribution, the order of important features from the
explanation models and GBDT were considered to calculate the nDCG score and ρ.
Both measures are used to compare the order of retrieved documents in information
retrieval. Specifically, nDCG score produces a quantitative measure to assess the
relevance between two sets of ranks of some entities. Here, these score values were
used to evaluate the feature ranking by the explanation models in contrast with
the prediction model. For nDCG, the values were calculated separately for all the
instances together and individually which are denoted as nDCGall and nDCGind
respectively in Table D.9. Similarly, ρ produced a similar measure to evaluate the
quality of two vectors of ranks which was used in parallel to support the nDCG score.
Further details on the computation of these metrics can be found in the respective
articles (Zar, 1972; Busa-Fekete et al., 2012). In this work, the values are computed
using methods from SciPy library for Python.
D.3 Results and Discussion
The outcome of the performed analysis, classification tasks and explanation

generation have been presented and discussed in this section with tables and
illustrations. The illustrations were prepared by adopting different methods of the
Matplotlib library of Python.
201
D.3.1 Exploratory Analysis

Aligning with the focus of project SIMUSAFE, i.e., enhancing the simulation
technologies to make the traffic environment safer, the exploratory analysis was
conducted. The outcome of the analysis was further utilised to develop training
simulators for road users with more intelligent agents which is out of the scope of the
work presented in this paper. Though, the insights explored from the analysis were
used to create intuition on the classification tasks and explanation.
The first step of the analysis was performed to assess the variation of vehicular
features between the simulation and track datasets over the laps that represent
different road scenarios, interchangeably termed as events as described in Table
D.1. Mostly, mean values were compared and two-sided Wilcoxon signed-rank tests
(Wilcoxon, 1992) were performed. In the significance test, the null hypothesis,
H0 was considered as “there is no difference between the observations of the two
measurements”. Subsequently, the alternate hypothesis, H1 was derived as “the
observations of the two measurements are not equal ” and the level of significance
was set to 0.05. The first comparison was done on the driving velocity. Figure
D.6 illustrates the average driving velocity in different laps for simulation and track
driving. The standard deviations are also associated with the respective error bars
in the plot. For both tests, it was observed that average velocity increased in laps
5 – 7. This aligned with the experimental protocol. From the two-sided Wilcoxon
signed-rank test, a statistically significant difference was observed between simulation
and track driving (t = 0.0, p = 0.0156), thus the alternate hypothesis H1 was
accepted. The analysis on the accelerator pedal position (Figure D.7) produced
a similar trend across the laps for both the tests and the statistical test had identical
outcomes.
Figure D.6: Average driving velocity in different laps. The two-sided Wilcoxon
signed-rank test demonstrates a significant difference in the simulator and track driving
with t = 0.0, p = 0.0156. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
From both the analysis of driving velocity and accelerator pedal position, it was
evident that drivers tend to drive at a higher velocity and press the accelerator
pedal more in simulation tests than in track tests. This is plausibly the cause
of simulator bias. In naive terms, drivers do not experience the motion of the
vehicle, and perceive the environment properly, e.g., the vibration of the vehicle,
202
Paper D
Figure D.7: Average accelerator pedal position across all the laps and the two-sided
Wilcoxon signed-rank test demonstrate a significant difference in the simulator and
track driving with t = 0.0, p = 0.0156. © 2023 by SCITEPRESS (CC BY-NC-ND
4.0).
the effect of road structures, etc. The differences in the driving behaviour have
been properly addressed with corresponding experts and it is a work in progress
to reduce the simulation biases in future studies. Moreover, while deploying ML
algorithms to classify drivers’ behaviour, these characteristics from non-stationary
spatiotemporal data might lead to incorrect interpretations. To correctly assess the
effects or contribution of the heterogeneous features, two different methods of XAI
were evaluated and presented in Section D.3.3.
(a) Lap 1. (b) Lap 2. (c) Lap 3.
(d) Lap 4. (e) Lap 5. (f ) Lap 6.
Figure D.8: GPS coordinates with varying driving velocity for a random participant
in laps 1 – 6. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
The driving velocity in each lap was also analysed based on different road
structures using scatter plots and heatmaps as illustrated in Figure D.8. In this
203
204
Table D.6: Performance measures of risky behaviour classification with the AI/ML models trained on the holdout test set of different
datasets. The best values for each metric and each dataset are highlighted in bold font. (Positive Class – Risk, Negative Class – No Risk ).
© 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
Simulation Dataset Track Dataset Combined Dataset
Metrics
GBDT LR MLP RF SVM GBDT LR MLP RF SVM GBDT LR MLP RF SVM
TP 105 82 86 23 100 106 88 103 56 105 229 186 226 233 228
FN 15 38 34 97 20 0 18 3 50 1 8 51 11 4 9
FP 16 26 42 0 14 3 24 5 0 4 30 62 45 26 23
TN 112 102 86 128 114 109 88 107 112 108 199 167 184 203 206
Precision 0.868 0.759 0.672 1.0 0.877 0.972 0.786 0.954 1.0 0.963 0.884 0.75 0.834 0.900 0.908
Recall 0.875 0.683 0.717 0.192 0.833 1.0 0.830 0.972 0.528 0.991 0.966 0.785 0.954 0.983 0.962
F1 score 0.871 0.719 0.694 0.322 0.855 0.986 0.807 0.963 0.691 0.977 0.923 0.767 0.89 0.940 0.934
Accuracy 87.50 74.19 69.36 60.89 86.29 98.62 80.73 96.33 77.06 97.71 91.85 75.75 87.98 93.56 93.13
Table D.7: Performance measures of hurry classification with the AI/ML models trained on the holdout test set of different datasets. The
best values for each metric and each dataset are highlighted in bold font. (Positive Class – Hurry, Negative Class – No Hurry). © 2023 by
SCITEPRESS (CC BY-NC-ND 4.0).
Simulation Dataset Track Dataset Combined Dataset
Metrics
GBDT LR MLP RF SVM GBDT LR MLP RF SVM GBDT LR MLP RF SVM
TP 92 90 61 110 84 70 66 56 81 68 145 130 137 143 149
FN 18 20 49 0 26 11 15 25 0 13 25 40 33 27 21
FP 8 22 25 90 10 13 25 18 59 9 24 75 41 31 33
TN 91 77 74 9 89 65 53 60 19 69 174 123 157 167 165
Precision 0.920 0.804 0.709 0.550 0.894 0.843 0.725 0.757 0.579 0.883 0.858 0.634 0.770 0.822 0.819
Recall 0.836 0.818 0.555 1.0 0.764 0.864 0.815 0.691 1.0 0.840 0.853 0.765 0.806 0.841 0.876
F1 score 0.876 0.811 0.622 0.710 0.824 0.854 0.767 0.723 0.733 0.861 0.855 0.693 0.787 0.831 0.847
Accuracy 87.56 79.90 64.59 56.94 82.78 84.91 74.84 72.96 62.89 86.16 86.69 68.75 79.89 84.23 85.33
Paper D
analysis, the seventh lap was excluded because of the presence of surprise which
reduced the data from driving the full lap. The pattern of driving velocity in laps 1
– 3 (Figure D.8a – D.8c) was found to be identical. The variation increased in laps 4
– 6 (Figure D.8d – D.8f) when several variables were added to the lap scenarios. The
illustrated driving patterns were cross-checked with psychologists’ assessments of the
participants and their conclusive drivers’ rules of behaviour. For example, on a left
turn, the behaviour of drivers can be stated as – “if the road is one carriageway, then
you have to gradually move on the left and look for cars coming from the opposite
direction before turning left”. In all the sub-figures of Figure D.8, it can be observed
that, at the left turn near longitude 500 and latitude 750, the driver slowed down to
examine oncoming vehicles and moved towards left before the turn as to road was
single carriageway by design. Another major observation can be found in lap 6 at
the lower middle of the circuit near longitude 550 and latitude 725 (Figure D.8f).
There was a signal with a pedestrian crossing and the driving velocity was close to
zero which indicates that the stop signal was lit or a pedestrian was crossing and
the driver responded to the signal. Thus, drivers’ behaviours at different events in
terms of road infrastructures were analysed and the observations were put forward
to respective experts for enhancing the quality of the agents in future simulators.
D.3.2 Classification
The classification of drivers’ behaviour was done in two folds; risk and hurry. It
is arguable that hurried driving can induce risk. On the contrary, hurriedness is
often observed among drivers who drive safely. Driving safely refers to specific
behaviours as an example is stated in Section D.3.1. Based on the drivers’ rules of
behaviour proposed by the experts, classifying risk and hurry are considered separate
tasks. The performance of the trained models on the holdout datasets for risk and
hurry classification are presented in Tables D.6 and D.7 respectively. In both tasks
apparently, GBDT excelled over other models. However, for all the datasets in both
tasks, simpler ones among the investigated models produced better performance. The
use of precision and recall was justified by the nature of the classification tasks which
mostly concentrate the measures on classifying the positive class. In this work, the
positive class was set to be the presence of risk and hurry in drivers’ behaviour
which is more important than classifying their absence. One notable behaviour
was observed for RF that it performed poorly when used on the simulation and
track dataset separately but on the combined dataset it produced the result for risk
classification. In the case of hurry classification, the behaviour was quite altered.
Due to this fluctuation in the performance across different datasets and tasks, RF
was not further utilized to develop the explanation models.
Table D.8 presents the best classifier for both risk and hurry classification across
the three datasets. It is observed that overall GBDT performed better in every
combination that lead to its use in the explanation generation. Moreover, to
accumulate all the characteristics of the data in the explanation model only the
combined dataset has been used further.
205
Table D.8: Summary of model performances in terms of accuracy across different

datasets and classification tasks. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
Dataset Risk Hurry

Model GBDT GBDT
Simulation
Accuracy (%) 87.50 87.56
Model GBDT SVM
Track
Accuracy (%) 98.62 86.16
Model RF GBDT
Combined
Accuracy (%) 93.56 86.69
D.3.3 Explanation
Considering the prediction performance of GBDT across datasets and classification
tasks, explanation models SHAP and LIME were built to explain individual
predictions, i.e, local explanations. While explaining a single instance of prediction
from c both models mimic the inference mechanism of c to predict the instance within
their framework. The prediction performance of the explanation model was measured
with local accuracy described in Section D.2.5.2 and the values are presented in
Table D.9. It was observed that for both classification tasks, SHAP achieved higher
accuracy than LIME. Moreover, LIME performed very poorly in local predictions
for risk classification. However, both explanation models performed comparatively
poorer in terms of hurry classification.
Table D.9: Pairwise comparison of performance metrics for SHAP and LIME on
combined Xtest (holdout test set) for risk and hurry. For all the metrics, higher values
are better and highlighted in bold font. All the values for ρ are statistically significant
since p < 0.05. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
Risk Hurry
Metrics
SHAP LIME SHAP LIME
Accuracy 92.59% 52.98% 84.32% 70.06%
nDCGall 0.9561 0.8758 0.9588 0.9183
nDCGind 0.8717 0.8589 0.8671 0.8524
ρ 0.7664 0.5310 0.7059 0.4772,
p 7.91e−7 2.53e−3 1.31e−5 7.67e−3
It is arguably presented in the literature that the feature importance value of a

feature from a classifier is different in terms of weights from the contribution of the
feature in an additive feature attribution model (Letzgus et al., 2022). However,
normalizing the feature importance from GBDT and the contributions from SHAP
and LIME produced several similarities in the chosen order of features by the
methods. For example, all three methods had the same feature as the most influential
one in both tasks; vertical acceleration for risk and standard deviation of accelerator
pedal position in hurry classification (Figure D.9). In risk classification, it is justified
that vertical acceleration is the most contributing feature as it corresponds to the
lifting of the front part of the vehicle due to sudden acceleration. In this scenario,
the vehicle often gets out of control and the concerned events are - driving at the
206
Paper D
roundabout exits with pedestrian crossing, manoeuvring after a left turn, etc. In the
other classification task for hurry, the standard deviation of the accelerator pedal
position corresponds to a frequent pressing of the pedal with a varying intensity
which is plausibly an indication to hurry. Here, the concerned events are similar to
the events mentioned for risk.
Order Order
GBDT SHAP LIME GBDT SHAP LIME
14 15 15 15 9 9
9 10 14 17 21 25
20 22 24 22 20 17
6 9 25 13 11 10
13 24 30 25 22 26
15 20 21 14 25 24
10 6 9 6 24 21
8 16 20 7 10 29
3 2 5 11 7 20
21 19 19 2 8 6
12 27 28 10 6 4
1 1 1 9 4 3
7 4 6 3 3 8
19 17 13 1 1 1
11 18 12 18 26 14
4 7 11 16 23 13
22 8 8 8 18 23
18 11 7 19 14 5
28 29 27 20 13 12
17 13 4 12 15 16
24 21 18 27 19 19
27 30 29 23 28 22
30 26 23 28 16 7
29 23 16 30 29 30
23 28 26 26 17 15
2 5 2 5 5 11
5 3 3 21 12 18
16 12 10 4 2 2
25 14 22 24 27 27
26 25 17 29 30 28
Figure D.9: Feature importance values are extracted from GBDT, SHAP & LIME,
normalized and illustrated with horizontal bar charts for corresponding classification
tasks. The order of the features based on the importance values is presented in tables
on either side. Features with the same order across methods are highlighted in the
order tables. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
Several similar ranks of the features based on their contributions from both SHAP
and LIME motivated the comparison of nDCG scores that computes the similarity of
retrieved information. In this work, the retrieved information is the order of features
according to their importance values or contributions to prediction. The nDCG
scores were computed for all the instances together and also computed for individual
predictions and averaged. The rank of the features based on the normalized feature
importance from the base model GBDT was used as the reference while calculating
the nDCG score to assess how similar they are to the classifier model. Alike local
accuracy, SHAP produced better results than LIME in terms of nDCG score. To
investigate further, ρ was computed with a null hypothesis, ‘the rank of the features
in different methods are different’. However, with the test results, the hypothesis was
207
rejected as all the measurements came out to be statistically significant as the p value
was lower than 0.05. All the values of nDCG score and ρ are reported in Table D.9.
Another noteworthy aspect was observed from the metrics evaluating the explanation
models that SHAP produced better results for risk classification but the performance
of LIME was better for hurry classification. The performance of SHAP complements
the performance summary of the classification models presented in Table D.8 where
risk classification had better performance than hurry classification. It is also plausible
that, if the local accuracy of an explanation model is better, the rankings of the
attributed features are also more relevant which is evident in the corresponding nDCG
score and ρ values.
Figure D.10: Low fidelity prototype of proposed drivers’ behaviour monitoring system
for simulated driving. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
D.3.4 Proposed Interpretable System

Combining all the presented outcomes, a proposed system is designed for drivers’
behaviour monitoring for simulated driving. Figure D.10 illustrates a low-fidelity
prototype of the proposed system. The prototype consists of three segments; A, B
and C which also represent the flow of operation of the system. In segment A, a list
of participants and their driven laps will be listed. Upon selecting a participant and
specific lap, the GPS plot of the lap will be presented in segment B with a heatmap
representing the driving velocity. Moreover, the events in terms of road infrastructure
will be marked in green rectangles. The event rectangles will be coloured red and
orange for the presence of risk and hurry respectively. For concurrent presence, there
will be a double rectangle as shown in the illustration. In the next step, if an event
with risk or hurry is clicked, segment C will present the contributing features to
the specific classification and their contributions in terms of SHAP values. In the
prototype, an explanation for the selected risky event is shown. For segment C, users
can also set the number of contributing features to display in the top right corner.
This system can be efficiently utilized to analyse drivers’ behaviour to correct driving
styles to ensure a safer road environment for all users. The information shown in
segment C contains the features from both vehicle and EEG which are relevant to
the risky and hurried behaviour of the drivers according to the literature. An expert
from the corresponding domain can relate the change in feature values and their effect
208
Paper D
on the prediction and convey specific instructions to modify the drivers’ behaviour
to make their driving safer.
D.4 Conclusion and Future Works
The work presented in this paper can be summarised in three aspects: i) comparative
analysis of car drivers’ behaviour in the simulator and track driving for different
traffic situations, ii) development of classifier models to detect risk or hurry in
drivers’ behaviour and iii) explaining the risk and hurry classification with feature
attribution techniques with a proposed system for drivers’ behaviour monitoring in
simulated driving. The first outcome is found to be a novel analysis that includes
experimentation with simulation and track driving. The second and third outcomes
can be concurrently utilised in enhancing the simulator techniques to train road users
for a safer traffic environment through the functional development of the proposed
drivers’ behaviour monitoring system.
The outcome of this study is encouraging in terms of explanation methods that
require further research. The lack of prescribed evaluation metrics in the literature
led to the use of different borrowed metrics from different concepts. However, the
results showed promising possibilities to enhance and modify them for future works on
the evaluation of explanation methods. Another possible research direction would be
to improve the feature attribution methods to produce more insightful explanations.
Acknowledgements. This study was performed as a part of the project

SIMUSAFE funded by the European Union’s Horizon 2020 research and innovation
programme under grant agreement N. 723386.
Bibliography
Antwarg, L., Miller, R. M., Shapira, B., & Rokach, L. (2021). Explaining Anomalies
Detected by Autoencoders using Shapley Additive Explanations. Expert
Systems with Applications, 186, 115736.
Busa-Fekete, R., Szarvas, G., Élteto, T., & Kégl, B. (2012). An Apple-to-apple
Comparison of Learning-to-rank Algorithms in terms of Normalized
Discounted Cumulative Gain. Proceedings of the Workshop on Preference
Learning: Problems and Applications in AI co-located with the 20th European
Conference on Artificial Intelligence (ECAI), 242.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002).
SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial
Intelligence Research, 16, 321–357.
209
Letzgus, S., Wagner, P., Lederer, J., Samek, W., Muller, K.-R., & Montavon, G.
(2022). Toward Explainable Artificial Intelligence for Regression Models: A
Methodological Perspective. IEEE Signal Processing Magazine, 39 (4), 40–58.
Liu, Y., Khandagale, S., Khandagale, S., White, C., & Neiswanger, W. (2021).
Synthetic Benchmarks for Scientific Research in Explainable Machine
Learning. In J. Vanschoren & S. Yeung (Eds.), Proceedings of the Neural
Information Processing Systems - Track on Datasets and Benchmarks
(NeurIPS Datasets and Benchmarks).
Oostenveld, R., Fries, P., Maris, E., & Schoffelen, J.-M. (2010). FieldTrip: Open
Source Software for Advanced Analysis of MEG, EEG, and Invasive
Electrophysiological Data. Computational Intelligence and Neuroscience,
2011, e156869.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011).
Scikit-learn: Machine Learning in Python. Journal of Machine Learning
Research, 12 (85), 2825–2830.
Mining (KDD), 1135–1144.
Sætren, G. B., Lindheim, C., Skogstad, M. R., Andreas Pedersen, P., Robertsen,
R., Lødemel, S., & Haukeberg, P. J. (2019). Simulator versus Traditional
Training: A Comparative Study of Night Driving Training. Proceedings of the
Human Factors and Ergonomics Society Annual Meeting, 63 (1), 1669–1673.
Serradilla, O., Zugasti, E., Ramirez de Okariz, J., Rodriguez, J., & Zurutuza, U.
(2021). Adaptable and Explainable Predictive Maintenance: Semi-Supervised
Deep Learning for Anomaly Detection and Diagnosis in Press Machine Data.
Sokolova, M., & Lapalme, G. (2009). A Systematic Analysis of Performance Measures
for Classification Tasks. Information Processing & Management, 45 (4),
427–437.
Voigt, P., & Von Dem Bussche, A. (2017). The EU General Data Protection
Regulation (GDPR) - A Practical Guide. Springer International Publishing.
Wilcoxon, F. (1992). Individual Comparisons by Ranking Methods. In S. Kotz &
N. L. Johnson (Eds.), Breakthroughs in Statistics (pp. 196–202). Springer
New York.
Wu, S.-L., Tung, H.-Y., & Hsu, Y.-L. (2020). Deep Learning for Automatic Quality
Grading of Mangoes: Methods and Insights. 2020 19th IEEE International
Conference on Machine Learning and Applications (ICMLA), 446–453.
210
Paper D
Zar, J. H. (1972). Significance Testing of the Spearman Rank Correlation Coefficient.

Journal of the American Statistical Association, 67 (339), 578–580.
Zhou, F., Alsaid, A., Blommer, M., Curry, R., Swaminathan, R., Kochhar,
D., Talamonti, W., & Tijerina, L. (2022). Predicting Driver Fatigue in
Monotonous Automated Driving with Explanation using GPBoost and
SHAP. International Journal of Human–Computer Interaction, 38 (8),
719–729.
211
Paper E
Investigating Additive Feature Attribution for

Regression
Islam, M. R., Weber, R. O., Ahmed, M. U. & Begum, S.
E
Paper F
iXGB: Improving the Interpretability of XGBoost

using Decision Rules and Counterfactuals
Islam, M. R., Ahmed, M. U., & Begum, S.

DoctoralDissertation MirRiyanulIslam

Uploaded by

Copyright:

Available Formats

DoctoralDissertation MirRiyanulIslam

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DoctoralDissertation MirRiyanulIslam

Uploaded by

Copyright:

Available Formats

Mälardalen University Press Dissertations

EXPLAINABLE ARTIFICIAL INTELLIGENCE FOR ENHANCING

Mir Riyanul Islam

School of Innovation, Design and Engineering

EXPLAINABLE ARTIFICIAL INTELLIGENCE FOR ENHANCING

Mir Riyanul Islam

som för avläggande av teknologie doktorsexamen i datavetenskap vid

Fakultetsopponent: Professor Kerstin Bach, Norwegian

Akademin för innovation, design och teknik

Mir Riyanul Islam

Artificial Intelligence (AI) is recognized as advanced technology that assists in

Publications included in the Thesis†‡ –

using corresponding alphabetical markers.

permission from the copyright holders, with typos corrected.

2 Background and Related Works . . . . . . . . . . . . . . . . . . . . 11

3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Explainable Artificial Intelligence for Decision Support

5 Summary of the Included Papers . . . . . . . . . . . . . . . . . . . . 41

6 Discussions, Conclusion and Future Works . . . . . . . . . . . . . 49

PART II Included Papers . . . . . . . . . . . . . . . . . . . . . . . . . 69

A1 Deep Learning for Automatic EEG Feature Extraction: An

A2 A Novel Mutual Information Based Feature Set for Drivers’

B A Systematic Review of Explainable Artificial Intelligence in

C Local and Global Interpretability using Mutual Information

D Interpretable Machine Learning for Modelling and

E Investigating Additive Feature Attribution for Regression . . 215

F iXGB: Improving the Interpretability of XGBoost using

1.1 Generalized mapping of the research questions, contributions, and the

2.1 Different stages of adding explainability to black box models. . . . . . . 12

3.1 The inter-connected significant aspects of the research methodology

4.1 Schematic diagram of developing explainable DSS for RS. . . . . . . . . 30

A2.1 Summary of the experimental protocol. ©2020 by Islam et al. (CC BY

B.1 Number of published articles (y-axis) on XAI made available through

C.1 Global explanation of mental workload classifier model with SHAP

E.1 Overview of the proposed approach of evaluating the additive feature

F.1 Example of explanation generated for a single instance of flight TOT

3.1 Summary of the datasets from the domain of RS. . . . . . . . . . . . . . 20

4.1 Performance of different models for classifying drivers’ MWL utilising

A2.1 Mapping among different EEG channels, three significant frequency

C.1 Summary of the designed convolutional encoder. ©2021 IEEE. . . . . . 183

AddCBR Additive Case-based Reasoning

A verdict without proper explanation is always questionable. This verdict could be

1.1 Research Goal and Objectives

Considering the urge for explainability in an AI-supported DSS, especially in the

1.2 Problem Formulation

With the recent developments of AI and ML algorithms, people from various

application domains (Loyola-Gonzalez, 2019). This issue raises interest in exploring

1.3 Research Questions

The literature contains evidence of diverse XAI methods to make traditional

RQ2.1: How can the feature extraction techniques be enhanced to develop

1.4 Research Contributions

RC1: Investigation of the current developments in XAI.

RC1.1: From a generic perspective of different application domains and tasks.

RC2.1: Algorithms to construct comprehensible features for MWL

1.4.1 Mapping of Research Questions, Contributions and Papers

1.5 Thesis Outline

The structure of this thesis is twofold and is organised as follows:

Research Questions (RQ) Research Contr ibutions (RC) Paper s

RC1: Investigation on the Paper B: A Systematic