DoctoralDissertation MirRiyanulIslam
DoctoralDissertation MirRiyanulIslam
DoctoralDissertation MirRiyanulIslam
No. 397
2024
Akademisk avhandling
To this end, this thesis work primarily developed explainable models for the application domains of
RS and ATFM. Particularly, explainable models are developed for assessing drivers' in-vehicle mental
workload and driving behaviour through classification and regression tasks. In addition, a novel method
is proposed for generating a hybrid feature set from vehicular and electroencephalography (EEG)
signals using mutual information (MI). The use of this feature set is successfully demonstrated to
reduce the efforts required for complex computations of EEG feature extraction. The concept of MI was
further utilized in generating human-understandable explanations of mental workload classification.
For the domain of ATFM, an explainable model for flight take-off time delay prediction from historical
flight data is developed and presented in this thesis. The gained insights through the development
and evaluation of the explainable applications for the two domains underscore the need for further
research on the advancement of XAI methods.
In this doctoral research, the explainable applications for the DSSs are developed with the additive
feature attribution (AFA) methods, a class of XAI methods that are popular in current XAI
research. Nevertheless, there are several sources from the literature that assert that feature
attribution methods often yield inconsistent results that need plausible evaluation. However, the
existing body of literature on evaluation techniques is still immature offering numerous suggested
approaches without a standardized consensus on their optimal application in various scenarios. To
address this issue, comprehensive evaluation criteria are also developed for AFA methods as the
literature on XAI suggests. The proposed evaluation process considers the underlying characteristics
of the data and utilizes the additive form of Case-based Reasoning, namely AddCBR. The AddCBR
is proposed in this thesis and is demonstrated to complement the evaluation process as the baseline to
compare the feature attributions produced by the AFA methods. Apart from generating an explanation
with feature attribution, this thesis work also proposes the iXGB-interpretable XGBoost. iXGB
generates decision rules and counterfactuals to support the output of an XGBoost model thus
improving its interpretability. From the functional evaluation, iXGB demonstrates the potential to be
used for interpreting arbitrary tree-ensemble methods.
In essence, this doctoral thesis initially contributes to the development of ideally evaluated explainable
models tailored for two distinct safety-critical domains. The aim is to augment transparency within
the corresponding DSSs. Additionally, the thesis introduces novel methods for generating more
comprehensible explanations in different forms, surpassing existing approaches. It also showcases a
robust evaluation approach for XAI methods.
ISBN 978-91-7485-626-2
ISSN 1651-4238
To my parents and family ...
Acknowledgements
This long journey of my doctoral studies would not be possible without the blessings
of the Almighty and the guidance, inspiration, and help of numerous people.
I would like to thank my supervisors, Prof. Mobyen Uddin Ahmed and Prof.
Shahina Begum, for providing me with the opportunity to pursue my doctoral studies
and deliberately guiding me through the process. I am deeply grateful for your
indispensable support and supervision throughout the journey.
I am immensely obliged to Prof. Rosina Weber for imparting invaluable knowledge
and insights through her mentorship during the crucial phase of my doctoral studies.
My special thanks to Dr. Shaibal Barua, from whom I am inspired and have
learned much over the years of my doctoral study. I would like to thank my colleagues,
Dr. Hamidur Rahman, Dr. Waleed Jmoona, Arnab Barua, and Md Rakibul Islam,
for their collaboration and moral support. I extend special gratitude to Dr. Shahriar
Hasan and Md Aquif Rahman for going above and beyond as colleagues and for their
unwavering support as friends and brothers during my doctoral studies.
I am thankful to my fellow doctoral students, colleagues and the administrative
staff at Mälardalen University, for their corresponding support. My sincere gratitude
goes to Prof. Sasikumar Punnekkat for his invaluable time and insightful feedback
in reviewing my doctoral research proposal and dissertation.
I would like to express my deep gratitude to the faculty examiner, Prof. Kerstin
Bach, and the grading committee members, Prof. Mark Sebastian Dougherty, Prof.
Fredrik Heintz, and Adj. Prof. Rafia Inam, for kindly accepting the invitation and
dedicating part of your valuable time to review the studies. It is truly my honour to
have you as the reviewers of this dissertation.
Most importantly, I would like to express my deepest and heartfelt gratitude to
my mother, Prof. Anjuman Ara Begum, my father, Mir Rashedul Islam, and my
sister, Rifa Zumana, for always standing by me and supporting me throughout this
journey from several thousand miles away. I especially acknowledge my mother, from
whom I got the inspiration to pursue my doctoral studies.
I am intensely grateful to my wife, Nuzat Naila Islam, for her constant support,
love, and encouragement. Thank you for bearing with me through the most
challenging phase of my life to date, as much as our lives, for listening patiently,
and for being kind to me.
I would like to express my heartfelt gratitude to all the teachers who played
a pivotal role in guiding and educating me, shaping my academic journey from
pre-school through university, and incrementally preparing me for the degree of
doctorate. Also, I want to express my sincere gratitude to all my friends from home
and abroad who supported and encouraged me during my doctoral studies.
i
The research studies presented in this doctoral thesis have received funding
from the following projects; i) ARTIMATION 1 , under SESAR Joint Undertaking,
(Grant Agreement No. 894238), ii) SIMUSAFE 2 , (Grant Agreement No. 723386),
both funded by the European Union’s Horizon 2020 Research and Innovation
Programme, and iii) BrainSafeDrive 3 , co-funded by the Vetenskapsrådet - The
Swedish Research Council and the Ministero dell’Istruzione dell’Università e della
Ricerca della Repubblica Italiana, under the Italy-Sweden Cooperation Program. I
extend my sincere gratitude to all the collaborators from these projects, and it has
been a privilege for me to be part of different research communities.
1 https://www.artimation.eu
2 https://www.cordis.europa.eu/project/id/723386
3 https://www.brainsafedrive.brainsigns.com
ii
Abstract
iii
the literature that assert that feature attribution methods often yield inconsistent
results that need plausible evaluation. However, the existing body of literature
on evaluation techniques is still immature offering numerous suggested approaches
without a standardized consensus on their optimal application in various scenarios.
To address this issue, comprehensive evaluation criteria are also developed for
AFA methods as the literature on XAI suggests. The proposed evaluation process
considers the underlying characteristics of the data and utilizes the additive form of
Case-based Reasoning, namely AddCBR. The AddCBR is proposed in this thesis and
is demonstrated to complement the evaluation process as the baseline to compare
the feature attributions produced by the AFA methods. Apart from generating
an explanation with feature attribution, this thesis work also proposes the iXGB –
interpretable XGBoost. iXGB generates decision rules and counterfactuals to support
the output of an XGBoost model thus improving its interpretability. From the
functional evaluation, iXGB demonstrates the potential to be used for interpreting
arbitrary tree-ensemble methods.
In essence, this doctoral thesis initially contributes to the development of ideally
evaluated explainable models tailored for two distinct safety-critical domains. The
aim is to augment transparency within the corresponding DSSs. Additionally, the
thesis introduces novel methods for generating more comprehensible explanations in
different forms, surpassing existing approaches. It also showcases a robust evaluation
approach for XAI methods.
iv
Sammanfattning
Artificiell intelligens (AI) är erkänt som en avancerad teknik som hjälper till att fatta
beslut med hög noggrannhet och precision. Många AI-modeller betraktas dock som
svarta lådor på grund av att de bygger på komplexa slutledningsmekanismer. Hur och
varför dessa AI-modeller når fram till ett beslut är ofta inte begripligt för mänskliga
användare, vilket leder till oro för att deras beslut inte är godtagbara. Tidigare
studier har visat att avsaknaden av tillhörande förklaringar i en för människor
begriplig form gör besluten oacceptabla för slutanvändarna. Forskningsområdet
förklarlig AI (XAI) erbjuder ett brett utbud av metoder med det gemensamma
temat att undersöka hur AI-modeller når fram till ett beslut eller förklarar det.
Dessa förklaringsmetoder syftar till att öka transparensen i beslutsstödsystem
(DSS), vilket är särskilt viktigt inom säkerhetskritiska områden som vägsäkerhet
och flygtrafikflödeshantering. Trots den pågående utvecklingen befinner sig DSS
fortfarande i en utvecklingsfas för säkerhetskritiska tillämpningar. Förbättrad
transparens, som underlättas av XAI, framstår som en viktig faktor för att göra dessa
system praktiskt användbara i verkliga tillämpningar, och för att hantera acceptans-
och förtroendefrågor. Dessutom är det mindre troligt att certifieringsmyndigheterna
godkänner systemen för allmän användning efter det nuvarande mandatet Rätt till
förklaring från Europeiska kommissionen och liknande direktiv från organisationer
över hela världen. Denna önskan att genomsyra de rådande systemen med
förklaringar banar väg för forskningsstudier om XAI som är koncentrerade till
beslutsstödsystem.
För detta ändamål har denna avhandling främst utvecklat förklarbara modeller
för tillämpningsområdena vägsäkerhet och flygtrafikflödeshantering. I synnerhet
utvecklas förklarbara modeller för att bedöma förarnas mentala arbetsbelastning i
fordonet och körbeteende genom klassificerings- och regressionsuppgifter. Dessutom
föreslås en ny metod för att generera en hybridfunktionsuppsättning från fordons- och
elektroencefalografi (EEG) med hjälp av ömsesidig information (MI). Användningen
av denna funktionsuppsättning har framgångsrikt demonstrerats för att minska de
insatser som krävs för komplexa beräkningar av EEG-funktionsextraktion. Begreppet
MI användes vidare för att generera förklaringar av klassificeringen av mental
arbetsbelastning som är begripliga för människor. För flygtrafikflödeshantering
utvecklas och presenteras i denna avhandling en förklaringsmodell för förutsägelse
av tidsfördröjning vid start av flyg från historiska flygdata. De insikter som erhållits
genom utvecklingen och utvärderingen av de förklarbara tillämpningarna för de
två domänerna understryker behovet av ytterligare forskning om utvecklingen av
XAI-metoder.
v
I denna doktorsavhandling utvecklas de förklarbara applikationerna för DSS med
hjälp av additive feature attribution (AFA) metoder, en klass av XAI-metoder som är
populära inom aktuell XAI-forskning. Det finns dock flera källor i litteraturen som
hävdar att funktionsattributionsmetoder ofta ger inkonsekventa resultat som behöver
utvärderas på ett trovärdigt sätt. Den befintliga litteraturen om utvärderingstekniker
är dock fortfarande omogen och erbjuder många föreslagna tillvägagångssätt utan
ett standardiserat samförstånd om deras optimala tillämpning i olika scenarier.
För att ta itu med detta problem har omfattande utvärderingskriterier även
utvecklats för AFA-metoder som litteraturen om XAI föreslår. Den föreslagna
utvärderingsprocessen tar hänsyn till de underliggande egenskaperna hos data
och använder den additiva formen av case-based reasoning, nämligen AddCBR.
AddCBR föreslås i denna avhandling och demonstreras för att komplettera
utvärderingsprocessen för att jämföra de funktionsattributioner som produceras av
AFA-metoderna. Förutom att generera en förklaring med funktionstillskrivning
föreslår denna avhandling också iXGB – interpretable XGBoost. iXGB genererar
beslutsregler och kontrafakta för att stödja utdata från en XGBoost-modell och
därmed förbättra dess tolkningsbarhet. Den funktionella utvärderingen visar att
iXGB har potential att användas för att tolka godtyckliga träd-ensemble-metoder.
Sammanfattningsvis bidrar denna doktorsavhandling initialt till utvecklingen
av idealiskt utvärderade förklarbara modeller skräddarsydda för två distinkta
säkerhetskritiska domäner. Syftet är att öka transparensen inom de motsvarande
beslutsstödsystem. Dessutom introducerar avhandlingen nya metoder för att
generera mer begripliga förklaringar i olika former, vilket överträffar befintliga
tillvägagångssätt. Den visar också en robust utvärderingsmetod för XAI-metoder.
vi
List of Publications
A1. Islam, M. R., Barua, S., Ahmed, M. U., Begum, S., & Di Flumeri, G.
(2019). Deep Learning for Automatic EEG Feature Extraction: An Application
in Drivers’ Mental Workload Classification. In L. Longo & M. C. Leva
(Eds.), Human Mental Workload: Models and Applications. H-WORKLOAD
2019. Communications in Computer and Information Science (pp. 121–135).
Springer Nature Switzerland.
A2. Islam, M. R., Barua, S., Ahmed, M. U., Begum, S., Aricò, P., Borghini, G.,
& Di Flumeri, G. (2020). A Novel Mutual Information Based Feature Set for
Drivers’ Mental Workload Evaluation using Machine Learning. Brain Sciences,
10 (8), 551.
B. Islam, M. R., Ahmed, M. U., Barua, S., & Begum, S. (2022). A Systematic
Review of Explainable Artificial Intelligence in Terms of Different Application
Domains and Tasks. Applied Sciences, 12 (3), 1353.
C. Islam, M. R., Ahmed, M. U., & Begum, S. (2021). Local and Global
Interpretability using Mutual Information in Explainable Artificial Intelligence.
Proceedings of the 8th International Conference on Soft Computing & Machine
Intelligence (ISCMI 2021), 191–195.
D. Islam, M. R., Ahmed, M. U., & Begum, S. (2023). Interpretable Machine
Learning for Modelling and Explaining Car Drivers’ Behaviour: An Exploratory
Analysis on Heterogeneous Data. Proceedings of the 15th International
Conference on Agents and Artificial Intelligence (ICAART 2023), 392–404.
E. Islam, M. R., Weber, R. O., Ahmed, M. U., & Begum, S. (2023).
Investigating Additive Feature Attribution for Regression [Under Review].
Artificial Intelligence.
F. Islam, M. R., Ahmed, M. U., & Begum, S. (2024). iXGB: Improving the
Interpretability of XGBoost using Decision Rules and Counterfactuals [Under
Review]. 16th International Conference on Agents and Artificial Intelligence
(ICAART 2024).
† This thesis is a comprehensive summary of the listed papers that are referenced in the text
vii
Publications not included in the Thesis –
Journal
• Degas, A., Islam, M. R., Hurter, C., Barua, S., Rahman, H., Poudel, M.,
Ruscio, D., Ahmed, M. U., Begum, S., Rahman, M. A., Bonelli, S., Cartocci,
G., Di Flumeri, G., Borghini, G., Babiloni, F., & Aricó, P. (2022). A Survey
on Artificial Intelligence (AI) and eXplainable AI in Air Traffic Management:
Current Trends and Development with Future Research Trajectory. Applied
Sciences, 12 (3), 1295.
• Hurter, C., Degas, A., Guibert, A., Durand, N., Ferreira, A., Cavagnetto, N.,
Islam, M. R., Barua, S., Ahmed, M. U., Begum, S., Bonelli, S., Cartocci,
G., Di Flumeri, G., Borghini, G., Babiloni, F., & Aricó, P. (2022). Usage of
More Transparent and Explainable Conflict Resolution Algorithm: Air Traffic
Controller Feedback. Transportation Research Procedia, 66, 270–278.
• Ahmed, M. U., Islam, M. R., Barua, S., Hök, B., Jonforsen, E., & Begum,
S. (2021). Study on Human Subjects – Influence of Stress and Alcohol in
Simulated Traffic Situations. Open Research Europe, 1, 83.
Conference/Workshop
• Gorospe, J., Hasan, S., Islam, M. R., Gómez, A. A., Girs, S., &
Uhlemann, E. (2023). Analyzing Inter-Vehicle Collision Predictions during
Emergency Braking with Automated Vehicles. Proceedings of the 19th
International Conference on Wireless and Mobile Computing, Networking and
Communications (WiMob 2023), 411–418.
• Jmoona, W., Ahmed, M. U., Islam, M. R., Barua, S., Begum, S., Ferreira, A.,
& Cavagnetto, N. (2023). Explaining the Unexplainable: Role of XAI for Flight
Take-Off Time Delay Prediction. In I. Maglogiannis, L. Iliadis, J. MacIntyre,
& M. Dominguez (Eds.), Artificial Intelligence Applications and Innovations.
AIAI 2023. IFIP Advances in Information and Communication Technology
(pp. 81–93). Springer Nature Switzerland.
• Ahmed, M. U., Barua, S., Begum, S., Islam, M. R., & Weber, R. O. (2022).
When a CBR in Hand Better than Twins in the Bush. In P. Reuss & J.
Schönborn (Eds.), Proceedings of the 4th Workshop on XCBR: Case-based
Reasoning for the Explanation of Intelligent Systems (XCBR) co-located with
the 30th International Conference on Case-Based Reasoning (ICCBR 2022)
(pp. 141–152). CEUR.
• Islam, M. R., Barua, S., Begum, S., & Ahmed, M. U. (2019). Hypothyroid
Disease Diagnosis with Causal Explanation using Case-based Reasoning and
Domain-specific Ontology. In S. Kapetanakis & H. Borck (Eds.), Proceedings
of the Workshop on CBR in the Health Sciences (WHS) co-located with the 27th
International Conference on Case-Based Reasoning (ICCBR 2019) (pp. 87–97).
CEUR.
viii
Contents
PART I Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Research Goal and Objectives . . . . . . . . . . . . . . . . . . . . . 5
1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Mapping of Research Questions, Contributions and Papers 9
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
x
A2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A2.2 Background and Related Works . . . . . . . . . . . . . . . . . . . . 93
A2.2.1 Assessment of Drivers’ Mental Workload . . . . . . . . . . 94
A2.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 96
A2.3.1 Experimental Protocol . . . . . . . . . . . . . . . . . . . . 96
A2.3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . 97
A2.3.3 Mutual Information Based Feature Extraction . . . . . . . 102
A2.3.4 Prediction and Classification Models . . . . . . . . . . . . 105
A2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A2.4.1 Quantification of Drivers’ Mental Workload . . . . . . . . 107
A2.4.2 Drivers’ Mental Workload and Event Classification . . . . 107
A2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
A2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
xi
C.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 185
C.3.1 Mental Workload Classification . . . . . . . . . . . . . . . 185
C.3.2 Global and Local Explanation . . . . . . . . . . . . . . . . 185
C.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
xii
E.4.7 Evaluation of Additive Feature Attribution Methods . . . 235
E.5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . 240
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Appendix E.A Description of the Flight Delay Dataset . . . . . . . . . 248
Appendix E.B Selection of Optimal Number of Clusters . . . . . . . . . 250
Appendix E.C Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . 251
xiii
List of Figures
A1.1 The experimental circuit is about 2.5 kilometres long along Bologna
roads. ©Springer Nature Switzerland AG 2019. . . . . . . . . . . . . . . 77
A1.2 Steps in the traditional feature extraction technique. ©Springer Nature
Switzerland AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A1.3 Network architecture of the CNN-AE for feature extraction. ©Springer
Nature Switzerland AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . 79
A1.4 Variation in classification accuracy with respect to the change of
threshold on feature importance values. ©Springer Nature Switzerland
AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
xv
A1.5 MWL classification results in terms of Sensitivity and Specificity.
©Springer Nature Switzerland AG 2019. . . . . . . . . . . . . . . . . . 82
A1.6 AUC-ROC curves for different classifiers with features extracted by
traditional methods and CNN-AE where models were trained using
10-fold cross validation. ©Springer Nature Switzerland AG 2019. . . . . 82
A1.7 AUC-ROC curves for different classifiers with features extracted
by traditional methods and CNN-AE where models were trained
using leave-one-out (participant) cross validation. ©Springer Nature
Switzerland AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
xvi
B.4 SLR methodology stages following the guidelines from Kitchenham and
Charters (2007). ©2022 by Islam et al. (CC BY 4.0). . . . . . . . . . . 130
B.5 Flow diagram of the research article selection process adapted from the
PRISMA flow chart by Moher et al. (2009). ©2022 by Islam et al. (CC
BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
B.6 Word cloud of the (a) author-defined keywords and (b) keywords
extracted from the abstracts through natural language processing.
©2022 by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . 136
B.7 Number of publications proposing new methods of XAI from different
countries of the world and the top 10 countries based on the publication
count. ©2022 by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . 137
B.8 Chord diagram (Tintarev et al., 2018) presenting the number of selected
articles published on the XAI methods and evaluation metrics from
different application domains for the corresponding tasks. ©2022 by
Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . 140
B.9 Number of the selected articles published from different application
domains and clustered on the basis of AI/ML model type, stage, scope,
and form of explanations. ©2022 by Islam et al. (CC BY 4.0). . . . . . 141
B.10 Venn diagram with the number of articles using different forms of data
to assess the functional validity of the proposed XAI methodologies.
©2022 by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . 142
B.11 Distribution of the selected articles based on the stage, scope, and form
of explanations. ©2022 by Islam et al. (CC BY 4.0). . . . . . . . . . . . 145
B.12 Different forms of explanations. ©2022 by Islam et al. (CC BY 4.0). . . 147
B.13 UpSet plot presenting the distribution of different methods of evaluating
the explainable systems. ©2022 by Islam et al. (CC BY 4.0). . . . . . . 153
B.14 Different methods of evaluating explanations, which were presented in
the selected articles with the number of studies given in parentheses.
©2022 by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . 154
D.1 The experimental route for simulation and track tests. A detailed
description is presented in Section D.2.1. ©2023 by SCITEPRESS (CC
BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
D.2 The car simulator developed with DriverSeat 650 ST was used for
conducting the simulation tests. ©2023 by SCITEPRESS (CC
BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
D.3 Event extraction using GPS coordinates. ©2023 by SCITEPRESS (CC
BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
D.4 GPS coordinates of a single lap driving colour-coded with respect to
different road structures. ©2023 by SCITEPRESS (CC BY-NC-ND 4.0).197
D.5 Confusion Matrix for both Risk and Hurry Classification. ©2023 by
SCITEPRESS (CC BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . 201
xvii
D.6 Average driving velocity in different laps. The two-sided Wilcoxon
signed-rank test demonstrates a significant difference in the simulator
and track driving with t = 0.0, p = 0.0156. ©2023 by SCITEPRESS
(CC BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
D.7 Average accelerator pedal position across all the laps and the two-sided
Wilcoxon signed-rank test demonstrate a significant difference in the
simulator and track driving with t = 0.0, p = 0.0156. ©2023 by
SCITEPRESS (CC BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . 203
D.8 GPS coordinates with varying driving velocity for a random participant
in laps 1 – 6. ©2023 by SCITEPRESS (CC BY-NC-ND 4.0). . . . . . . 203
D.9 Feature importance values are extracted from GBDT, SHAP & LIME,
normalized and illustrated with horizontal bar charts for corresponding
classification tasks. ©2023 by SCITEPRESS (CC BY-NC-ND 4.0). . . 207
D.10 Low fidelity prototype of proposed drivers’ behaviour monitoring system
for simulated driving. ©2023 by SCITEPRESS (CC BY-NC-ND 4.0). . 208
xviii
F.2 Overview of the mechanism of the proposed iXGB. . . . . . . . . . . . . 264
F.3 Prediction Performance of XGBoost in terms of MAE for flight delay
prediction with different numbers of features ranked by XGBoost feature
importance from two different subsets of the data. . . . . . . . . . . . . 267
F.4 Comparison of prediction performance of iXGB and LIME in terms of
MAE with three different datasets. . . . . . . . . . . . . . . . . . . . . . 269
xix
List of Tables
A1.1 Traffic flow intensity in the experimental area during a day retrieved
from General Plan of Urban Traffic of Bologna, Italy. ©Springer Nature
Switzerland AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A1.2 Number of features selected from different techniques. ©Springer
Nature Switzerland AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . 78
A1.3 Parameters used in different classifiers. ©Springer Nature Switzerland
AG 2019. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A1.4 Average performance measures of classifiers applied on traditionally
extracted features. ©Springer Nature Switzerland AG 2019. . . . . . . 81
A1.5 Average performance measures of classifiers applied on features
extracted by CNN-AE. ©Springer Nature Switzerland AG 2019. . . . . 81
xxi
A2.7 Performance summary of classifying Low and High MWL with LgR,
MLP, SVM and RF classifier models using EEG and MI-based feature
on the holdout test set. ©2020 by Islam et al. (CC BY 4.0). . . . . . . 111
A2.8 Performance summary of classifying Car and Pedestrian events with
LgR, MLP, SVM and RF classifier models using EEG and MI-based
feature on the holdout test set. ©2020 by Islam et al. (CC BY 4.0). . . 112
B.1 Inclusion and exclusion criteria for the selection of research articles.
©2022 by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . 132
B.2 Questions for checking the validity of the selected articles. ©2022 by
Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . 132
B.3 List of prominent features extracted from the selected articles. ©2022
by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . 135
B.4 List of references to selected articles published on the methods of XAI
from different application domains for the corresponding tasks. ©2022
by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . . . . . . . . . 138
B.5 Different models used to solve the primary task of classification or
regression and their study count. ©2022 by Islam et al. (CC BY 4.0). . 142
B.6 Methods for explainability, stage and scope of explainability, forms of
explanations and the type of models used for performing the primary
tasks. ©2022 by Islam et al. (CC BY 4.0). . . . . . . . . . . . . . . . . 149
D.1 Associated scenarios for the laps of the experimental simulator and track
driving with varying driving conditions. ©2023 by SCITEPRESS (CC
BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
D.2 List of features extracted from vehicular signals. ©2023 by
SCITEPRESS (CC BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . 197
D.3 List of biometric features considering different frequency bands of EEG
signal. ©2023 by SCITEPRESS (CC BY-NC-ND 4.0). . . . . . . . . . . 198
D.4 Summary of the datasets from the simulator and track experiments for
risk and hurry classification. ©2023 by SCITEPRESS (CC BY-NC-ND
4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
D.5 Parameters used in tuning different AI/ML models for classifying risk
and hurry in driving behaviour with 5-fold cross validation. ©2023 by
SCITEPRESS (CC BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . 200
D.6 Performance measures of risky behaviour classification with the AI/ML
models trained on the holdout test set of different datasets. ©2023 by
SCITEPRESS (CC BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . 204
D.7 Performance measures of hurry classification with the AI/ML models
trained on the holdout test set of different datasets. ©2023 by
SCITEPRESS (CC BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . 204
D.8 Summary of model performances in terms of accuracy across different
datasets and classification tasks. ©2023 by SCITEPRESS (CC
BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
xxii
D.9 Pairwise comparison of performance metrics for SHAP and LIME on
combined Xtest (holdout test set) for risk and hurry. ©2023 by
SCITEPRESS (CC BY-NC-ND 4.0). . . . . . . . . . . . . . . . . . . . . 206
E.1 Methods, metrics or axioms used for evaluating XAI methods with
references to the works in which they were proposed or employed. . . . . 220
E.2 Summary of the generated synthetic datasets for evaluation. . . . . . . . 227
E.3 MAE and standard deviation (σAE ) of XGBoost and AddCBR
predicting flight delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
E.4 Average impact on prediction measured in percentage for the change in
values of top and bottom five features based on their importance from
XGBoost and AddCBR. . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
E.5 The maximum (maxnDCG ), average (µnDCG ), and standard deviation
(σnDCG ) of nDCG scores for the feature ranking from SHAP and LIME
for all the test instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
E.6 Average impact on prediction measured in percentage for the change in
values of top and bottom five features based on their contributions from
SHAP and LIME. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
E.A.1 List of features in the flight delay dataset and their associated identifiers
used to refer to the features in the article. . . . . . . . . . . . . . . . . . 248
E.C.1 MAE and standard deviation (σAE ) of XGBoost and AddCBR
predicting flight TOT delay with the two- and eight-cluster datasets. . . 252
E.C.2 Average impact on prediction measured in percentage for the change in
values of top and bottom five features based on their importance from
XGBoost and AddCBR for the two-cluster dataset. . . . . . . . . . . . . 254
E.C.3 Average impact on prediction measured in percentage for the change in
values of top and bottom five features based on their importance from
XGBoost and AddCBR for the eight-cluster dataset. . . . . . . . . . . . 254
E.C.4 The maximum (maxnDCG ), average (µnDCG ), and standard deviation
(σnDCG ) of nDCG scores for the feature ranking from SHAP and LIME
for all the test instances from the two-cluster dataset. . . . . . . . . . . 255
E.C.5 The maximum (maxnDCG ), average (µnDCG ), and standard deviation
(σnDCG ) of nDCG scores for the feature ranking from SHAP and LIME
for all the test instances from the eight-cluster dataset. . . . . . . . . . . 255
E.C.6 Average impact on prediction measured in percentage for the change in
values of top and bottom five features based on their contributions from
SHAP and LIME for the two-cluster dataset. . . . . . . . . . . . . . . . 257
E.C.7 Average impact on prediction measured in percentage for the change in
values of top and bottom five features based on their contributions from
SHAP and LIME for the eight-cluster dataset. . . . . . . . . . . . . . . . 257
F.1 Summary of the datasets used for evaluating the performance of iXGB. 267
F.2 Coverage scores of the rules extracted from iXGB and LIME. . . . . . . 270
F.3 Set of counterfactuals generated using iXGB from the Auto MPG dataset.271
F.4 Set of counterfactuals generated using iXGB from the Boston Housing
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
F.5 Set of counterfactuals generated using iXGB from the Flight Delay dataset.272
xxiii
List of Abbreviations & Acronyms
xxv
MI Mutual Information
ML Machine Learning
MLP Multi-layer Perceptron
MSE Mean Squared Error
MWL Mental Workload
nDCG Normalised Discounted Cumulative Gain
NN Neural Networks
RBF Radial Basis Function
RC Research Contribution
RF Random Forest
ROC Receiver Operating Characteristic
RQ Research Question
RS Road Safety
SHAP Shapley Additive Explanations
SLR Systematic Literature Review
SVM Support Vector Machine
TOT Take-off Time
XAI Explainable Artificial Intelligence
XGBoost Extreme Gradient Boosting
xxvi
Part I
Thesis
1
Chapter 1
Introduction
This chapter introduces the research context and presents the problem
formulation, objectives, research questions and research contributions of the
doctoral research study, followed by the outline of this thesis.
3
XAI for Enhancing Transparency in DSS
After decades, as the third wave, XAI has been proposed by researchers, which
can overcome the barrier of explainability and enable end users to understand
and effectively manage the emerging generation of AI systems (Gunning & Aha,
2019). Furthermore, some AI or ML models are still opaque, unintuitive, and
incomprehensible to humans regarding their inference mechanisms (Ribeiro et al.,
2016; Guidotti et al., 2019; Mueller et al., 2019). These models, such as Deep
Learning (DL) models, Support Vector Machines (SVM), etc. are often termed
as black box models since it is not clear to the end users how the models reach
the decision, i.e., the inference mechanism is not explicit. These black box models
construct the prevailing systems for assisting humans in various decision-making tasks
because of their commendable performance in decision accuracy. On the contrary,
they lack interpretability and/or explainability since the inference mechanisms of
these models are not transparent to the end-users for decision-making tasks. Thus,
the term transparency evolves that refers to the opposite characteristics of the black
box models, i.e., the understanding of the mechanism by which a model works
to assist the end users in decision-making (Adadi & Berrada, 2018; Lipton, 2018;
Barredo Arrieta et al., 2020).
Decision Support Systems (DSS) comprise an area in Information Systems, which
focuses on supporting and improving the decision-making process by humans (Arnott
& Pervan, 2014). There are different types of DSS, e.g., expert systems, analytic
systems, recommender systems, etc., that are used in various application domains
(Arnott & Pervan, 2014). DSS are one of the most facilitated applications of
traditional AI models (Negnevitsky, 2004). In addition, the conventional belief in
DSS literature is that when decision-makers are presented with enhanced processing
capabilities, they are likely to employ them for a more in-depth analysis of problems,
resulting in improved decision-making (Todd & Benbasat, 1992). Thus, the research
on developing explainable models in the DSS is emerging with the goal of enhancing
the transparency of the decision-making process. Adding explainability to a DSS
refers to incorporating details of the inference mechanism of the models that
present the initial decision to humans. The traditional AI-based DSS face the
challenge of being a black box system, where it is hard to understand the reasoning
behind a specific decision or output. Without explanations, Users might face
challenges in recognizing and dismissing inaccurate recommendations from a system,
and they may hesitate to accept valid advice from the system, even when it is
accurate (Lacave & Díez, 2002, 2004; Martens & Provost, 2014). XAI investigates
different methodologies to address this challenge by generating an interpretable and
human-understandable explanation for the decisions of AI-based models. It would
enable the practitioners with operational awareness to make the final decision, thus
ruling out the acceptability issues. Researchers from several safety-critical domains
are currently investigating the DSS with traditional AI models, such as Road Safety
(RS) and Air Traffic Flow Management (ATFM). In these domains, the availability
of a transparent and interpretable AI system would enable better-informed decisions,
increased trust, and reduced liability. This also aligns with the running hypothesis
of XAI research, that is to build more transparent, interpretable, or explainable
systems so that the users will be better equipped to understand and therefore trust
the intelligent agents (Mercado et al., 2016; Miller, 2019).
4
Chapter 1. Introduction
5
XAI for Enhancing Transparency in DSS
To generalize the formulated problems, set the specific objectives for this doctoral
research, and upon completion, to facilitate the validation of the research with
concrete outcomes, the following RQs are asserted:
RQ1: How are the XAI methods implemented and evaluated across various
application domains?
6
Chapter 1. Introduction
The RQs outlined in Section 1.3 are addressed with several research contributions
(RC) in this thesis. Here, the contributions are outlined with brief descriptions, along
with mentions of their dissemination in the included research papers:
7
XAI for Enhancing Transparency in DSS
Due to the current outrage of the XAI research, a huge number of regular
publications contain different aspects of explainability. Through a proper
literature review, the noteworthy developments in XAI are summarised
as a crib sheet for initiating research work on XAI methodologies with
quick references. This contribution addresses the RQ1 and includes specific
contributions from different perspectives:
The RC1 has been disseminated in two of the included papers in this
thesis. Paper B contains a systematic literature review of the exploited
XAI methods across different domains and applications, which corresponds
to RC1.1. The RC1.2 is asserted particularly for the developments in the
approaches of evaluating XAI research upon realizing that there exists a
need for a consensus on the evaluation methods for XAI. Paper B presents
the evaluation approaches for XAI in different domains and applications.
Particularly, the approaches for evaluating XAI methods through quantitative
experiments are discussed in the background section of the Paper E.
RC2: Development of applications of XAI for RS and ATFM domains.
DSS developed with AI models are already prevailing in the safety-critical
domains of RS and ATFM. However, these domains require further attention
in terms of XAI research to enhance the transparency and acceptability of
the prevailing systems. To address the issue, the following contributions are
made to enhance the transparency of the systems developed with explainable
methodologies for the domain of RS and ATFM:
The first three contributions are for the domain of RS. Papers A1 and A2
both disseminate the RC2.1, i.e., the construction of a feature set that is
comprehensible to humans for interpreting drivers’ MWL assessment model.
Though RC2.1 is not directly connected to the methodologies of XAI, it
contributes to data transparency and the outcome of the corresponding
studies influenced the use of the developed methodology in RC2.2, and this
contribution is disseminated in Papers A1, A2, and C. The RC2.3 is about the
development of explainable driving behaviour classification models, and their
evaluations are presented in Paper D. The RC2.4 is asserted for the domain of
ATFM, which concerns the development of an explainable flight TOT delay
prediction model. The development of the explainable model is described in
8
Chapter 1. Introduction
Section 4.2.1 of this thesis. In addition, the flight TOT delay prediction model
is used in the process of developing an evaluation approach for XAI methods
and generating rule-based and counterfactual explanations, that are presented
in Papers E and F, respectively.
In brief, concerning the RQs of this doctoral research, RC2.2, RC2.3, and
RC2.4 correspond to RQ2.2 in terms of RS and ATFM. In addition, RC2.3
and RC2.4 correspond to RQ2.3. And, RQ2.1 is addressed by RC2.1 and
RC2.2 combinedly.
RC3: Advancement of the XAI research field.
The research field of XAI is continuously growing, and diverse methods are
evolving regularly. Still, the literature indicates that certain aspects need
further attention to advance XAI research, especially the evaluation of XAI
methods. To this end, the following contributions are asserted to advance the
emerging field of XAI research:
RC3.1: Methods for generating explanation in various forms.
RC3.2: Approach for evaluating XAI methods.
Two novel methods for generating explanations are developed in this doctoral
research, which are accorded as the RC3.1. The first method is the
additive form of Case-based Reasoning (CBR), namely AddCBR. The second
method is developed to interpret the decision of Extreme Gradient Boosting
(XGBoost) models. The method is named Interpretable XGBoost (iXGB).
The development of the methods AddCBR and iXGB are discussed in Papers
E and F, respectively. The last but not least contribution of this doctoral
thesis, RC3.2 concerns the development of an approach to evaluate XAI
methods using a synthetic dataset that captures the intrinsic behaviour of the
original data. The evaluation method is described in Paper E. As a whole,
the RC3 encapsulates the contributions to the advancements of the research
field of XAI and thus addresses specifically RQ2.2 and RQ2.3.
• Part I – Thesis
This part provides a comprehensive overview of the thesis. It contains the
introduction to the thesis in Chapter 1, background and related work in Chapter
2, materials and methods used in this doctoral study are discussed in Chapter 3,
description of the developed application, their evaluation methods and results
9
XAI for Enhancing Transparency in DSS
Figure 1.1: Generalized mapping of the RQs, RCs, and the included papers. For
a concise presentation, the titles of the papers and the RCs are presented with
abbreviations, and the sequence of the papers is rearranged to minimize the overlapping
of the links.
10
Chapter 2
This chapter contains the foundational insights into the research domain
and presents a critical review of the existing literature within the specific
applications addressed in this thesis.
11
XAI for Enhancing Transparency in DSS
12
Chapter 2. Background and Related Works
Figure 2.2: Different scopes of adding explainability to black box models illustrated
with an example decision tree. The nodes in black colour refer to decision nodes for
a single feature. The green and red edges in the tree refer to positive and negative
outcomes, respectively, for satisfying the conditions.
13
XAI for Enhancing Transparency in DSS
14
Chapter 2. Background and Related Works
15
XAI for Enhancing Transparency in DSS
help to ensure that explainable AI is both accurate and effective in supporting human
decision-making by comparing two XAI methods that belong to the category of AFA
methods, namely, SHAP (Lundberg & Lee, 2017) and LIME (Ribeiro et al., 2016).
In addition, this research also develops a novel method of generating a benchmark
dataset to evaluate the XAI methods.
Road safety is largely related to the drivers’ behaviour and characteristics as more
than 90% of traffic injuries occur due to drivers’ errors while driving (Sam et al.,
2016), which are a combination of several dynamic and complex activities including
simultaneous visual, cognitive and spatial tasks (H. Kim et al., 2018). Fastenmeier
and Gstalter (2007) defined driving as a human-machine system that continuously
changes with the environment. The components of the environments are traffic flow
(high or low), road layout (straight, junctions, roundabout or curves), road design
(motorways, city or rural), weather (rainy, snowy or windy), time of day (morning,
midday or evening), etc. These components define the overall complexity of the
driving task. To increase drivers’ vigilance during driving, different policy-making
authorities worldwide have published mandates for compulsory installation of safety
features in the new productions of automotive vehicles from the year 2022 (European
Commission, 2019). These mandates demand extensive research on the development
of intelligent systems for road safety that are transparent and acceptable to humans.
16
Chapter 2. Background and Related Works
Air Traffic Flow Management (ATFM) is a branch of Air Traffic Management that
is a vast and complex domain (Erdi, 2008) encompassing all activities carried out to
ensure the safety and fluidity of air traffic. Specifically, ATM aims at efficiently
managing and maximising the use of the different resources available to it, for
example, the airspace and its subdivisions such as the sectors, the air routes, the
airport, and the runways by the users of the resources, e.g., aircrafts, airlines, in
any timeframe of their use of the resources, i.e., in the taxi phase in the airport, or
any flight phase simplified by the triplet climb, cruise, descent—while ensuring flight
safety (Allignol et al., 2012).
However, ATFM largely deals with balancing the demand and capacity of air
traffic by modifying the airspace (Degas et al., 2022; Hurter et al., 2022). Generally,
the airspace is divided into several sectors by the respective flying authorities, and
the airspaces have a limited capacity to handle the number of flights flying over
the airspace. The demand for airspace usage is put forward by different aircraft
carriers. The ATFM practitioners analyse the historical data of flown flights and
maintain the balance of demand and capacity by considering the propagation of
flight take-off time (TOT) delays. ATFM costs, on average, approximately 100 Euros
per minute for airlines (Cook & Tanner, 2015). According to the Federal Aviation
Administration1 (FAA) report2 in 2019, the estimated cost due to delay, considering
airlines, passengers, lost demand, and indirect costs, was thirty-three billion dollars.
This high cost justifies the increased interest in predicting TOT delays (Dalmau et
al., 2021).
17
XAI for Enhancing Transparency in DSS
increasing the trust of ATFM practitioners in the AI-based DSS that motivates the
development and evaluation of an explainable flight delay prediction model within
the scope of this doctoral research study.
18
Chapter 3
This doctoral research has been conducted as exploratory research under the
framework of three scientific research projects: two for the domain of Road Safety
(RS) and one for the domain of Air Traffic Flow Management (ATFM). The first
project SIMUSAFE 1 (SIMUlation of behavioural aspects for SAFEr transport),
was funded by the European Union’s Horizon 2020 Research and Innovation
Programme. This project was a collaboration among several institutions across
Europe and aimed to improve driving simulators and traffic simulation technology
with Machine Learning (ML) to safely assess risk perception and decision-making
of four principal types of road users - Car Drivers, Motorcyclists, Bicycles and
Pedestrians. In this research study, the behaviours of car drivers are analysed
only; other types of road users were analysed by other research partners. The
second project BrainSafeDrive 2 developed a technology to detect drivers’ in-drive
mental state from neuro-physiological signals for improving the safety of the
road. It was initiated as a collaboration between the Sapienza University of
Rome & BrainSigns s.r.l.3 from Italy and the Mälardalen University (MDU)
from Sweden co-funded by the Vetenskapsrådet - The Swedish Research Council4
and the Ministero dell’Istruzione dell’Università e della Ricerca della Repubblica
Italiana5 under Italy-Sweden Cooperation Program. Both the projects, SIMUSAFE
and BrainSafeDrive were Lastly, the third project ARTIMATION 6 (Transparent
Artificial Intelligence and Automation to Air Traffic Management Systems), was a
collaborative project that was conducted to provide a transparent and explainable
model through visualisation, data-driven storytelling and immersive analytics. This
1 https://www.cordis.europa.eu/project/id/723386
2 https://www.brainsafedrive.brainsigns.com
3 https://www.brainsigns.com
4 https://www.vr.se
5 https://www.mur.gov.it
6 https://www.artimation.eu
19
XAI for Enhancing Transparency in DSS
The data for the thesis work was acquired as parts of all the projects mentioned
earlier. The data utilised for the studies on applications of XAI in RS was acquired
from BrainSafeDrive and SIMUSAFE, which contained heterogeneous forms of data
from driving experiments. For ATFM, the aviation dataset was acquired within the
framework of the project ARTIMATION.
20
Chapter 3. Materials and Methods
Parameters Values
Number of instances 7,613,584
Number of aircrafts 18,214
Number of flights 609,202
Maximum number of flights per day 152
Minimum number of flights per day 1
Average number of flights per day 12
Minimum delay (minutes) 0
Maximum delay (minutes) 68
Average delay (minutes) 15
21
XAI for Enhancing Transparency in DSS
This doctoral thesis presents the outcome of the research conducted at different levels,
such as literature review, data collection and development of new methodologies
through several exploratory research (Yeation et al., 1995), i.e., all the studies have
been conducted with three stages: exploration, generation and evaluation. And, the
inductive approach (Young et al., 2020) has been followed to present the outcome of
the studies, which includes three steps: observation, generalization, and paradigm.
The levels concerning the exploratory research works and literature review are briefly
described in the following paragraphs, and Figure 3.1 illustrates the connections
among the significant aspects, including data collection.
The performed exploratory research works are scrutinised below, which are
presented altogether to enhance the transparency of DSS with the utilisation of XAI
methods:
• In the initial stage of the doctoral study, the traditional technique of feature
extraction from EEG signal was automated with the use of the convolutional
neural network - autoencoder (CNN-AE). This exploration of CNN-AE reduced
the complex and computationally expensive approach of signal processing and
manual calculations of EEG feature extraction for assessing drivers’ mental
workload.
• A novel hybrid template for vehicular parameters and EEG to measure drivers’
in-vehicle mental workload was derived using mutual information (MI). Upon
recording the EEG signal once, the use of the template can further be utilised to
9 https://www.aptiv.com
10 https://www.eurocontrol.int/dashboard/rnd-data-archive
22
Chapter 3. Materials and Methods
Use of vehicular data and EEG in Definition of a hybrid template Data of natural driving collected
driver monitoring for vehicular signal and EEG to in project BrainSafeDrive
measure drivers' in-vehicle MWL
Development of an explainable
model for flight TOT delay
prediction
Development of methods to
generate explanations in different
forms
Development of an evaluation
approach for XAI methods with
synthetic data
estimate the features of the EEG signal from the concurrent vehicular signal.
It would reduce the complexity of recording drivers’ EEG repeatedly while
driving.
• An explainable mental workload assessment model for car drivers was developed
with SHAP. Here, MI was used to relate the auto-encoded features with the
traditional features of EEG and presented with chord diagrams along with the
visual explanation from SHAP.
• Explainable models for classifying drivers’ risky and hurried behaviours were
developed. The explanation was generated with two different Additive Feature
Attribution (AFA) methods: SHAP and LIME. Here, the evaluation of the
feature attribution was done through quantitative metrics that were used in
other domains but resemble similar functionality to AFA.
• The prediction models developed with XGBoost are better than traditional
decision trees, but they lack interpretability. A novel method of devising a
single decision tree to represent the inference of several trees from XGBoost to
retain the interpretability of the models.
• Evaluation of the explainable models is an important topic of XAI research
because of the scarcity of standard metrics and methods to assess the
performance of the explainable models. To support the field of evaluating
explainable models, a method of generating benchmark datasets is being
23
XAI for Enhancing Transparency in DSS
The dissemination of the exploratory research works and their connection to the
specific research questions and contributions are discussed in Section 1.4.
3.3 Methods
This section presents the summaries of the classifier or regression models, XAI
methods and evaluation metrics invoked across different experimental studies
presented in this thesis. Implementation details are presented in the respective papers
included in this thesis.
24
Chapter 3. Materials and Methods
similar properties within the test dataset (Larose, 2004). Furthermore, k-NN is
deemed a universally consistent classifier (Luxburg & Schölkopf, 2011), employing the
Euclidean distance metric to identify the k closest neighbours in the dataset for each
instance. Given its reliance on a distance function, explaining the nearest-neighbour
model during predictions is straightforward. However, explaining the inherent
knowledge acquired by the model can be challenging.
25
XAI for Enhancing Transparency in DSS
3.3.2.1 SHAP
SHAP – Shapley Additive Explanations (Lundberg & Lee, 2017), is an explainability
tool encompassing mathematical technique that was developed based on the Shapley
values proposed by Shapley (1953) in the cooperative game theory. Shapley values
are a mechanism to fairly assign impact to features that might not have an equal
influence on the predictions. To generate additive explanations for predictions from
black-box models, the concept of Shapley value was incorporated. In delay prediction,
26
Chapter 3. Materials and Methods
to explain the decisions from the model (i.e., prediction), SHAP calculates the
contribution of each feature in the prediction from the model. SHAP is available
as a tool11 for Python, which generates explanations for text and image data with
Explainer implementation. For tabular data, KernelExplainer is model-agnostic and
TreeExplainer for tree-based models both singular and ensembles.
3.3.2.2 LIME
LIME stands for Local Interpretable Model-agnostic Explanations (Ribeiro et al.,
2016). It is a tool that uses an interpretable model to approximate each individual
prediction made by any black box ML model. LIME uses a three-step process to
determine the specific contributions of the chosen features: perturbing the original
data points, feeding them to the black-box model, and then observing the related
predictions. LIME is available as a package12 for Python, which is used to generate
explanations for tabular, image and text data.
3.3.2.3 DALEX
Model-Agnostic Language for Exploration and Explanations, in short, DALEX is
a Python library built upon the software for explainable ML proposed by Biecek
(2018). The main goal of the DALEX tool is to create a level of abstraction around
a model that makes it easier to explore and explain the model. Explanation deals
with two uncertainty levels: model level and explanation level. The underlying idea
is to capture the contribution of a feature to the model’s prediction by computing the
shift in the expected value of the prediction while fixing the values of other features.
In this study, for flight TOT delay prediction, DALEX has been used as a Python
package13 to generate an interactive Breakdown plot, which detects local interactions
of user-selected features.
11 https://shap.readthedocs.io
12 https://github.com/marcotcr/lime
13 https://pypi.org/project/dalex
27
Chapter 4
This thesis work partially contributes to the enhancement of DSS in the domain of
RS by developing two distinct applications that feature explainable models. The first
application is developed for assessing drivers’ in-vehicle Mental Workload (MWL),
while the second application focuses on monitoring drivers’ driving behaviour in
terms of risk and hurry. In the training of these models, a combination of
Electroencephalography (EEG) signals and vehicular data is employed. Novel
29
XAI for Enhancing Transparency in DSS
approaches have been devised to maximise the utility of EEG signals and extract
features that are comprehensive to humans. This includes the development of a Deep
Learning (DL) based feature extraction method for EEG signals. Additionally, a
hybrid template has been developed to effectively combine vehicular signals with EEG
data, leading to a more comprehensive and interpretable model for RS applications.
For both applications, the explainable models are developed with popular XAI
methods and comparatively evaluated using quantitative measures.
30
Chapter 4. Explainable Artificial Intelligence for Decision Support Systems
2017). However, the manual technique of extracting features from EEG signals is
complex, laborious and computationally expensive. To address these challenges, this
study harnessed the computational power of DL, specifically the Autoencoder (AE)
from a Convolutional Neural Network (CNN), to extract features from EEG signals
automatically.
The architecture of the CNN-AE is elaborately described in Section A1.3.3 of
Paper A1. The design of the AE has been evaluated by comparing the AE-extracted
features with manually extracted features for classifying drivers’ MWL into high and
low classes. In particular, to support the effectiveness of these features, four different
classifiers have been employed, including Support Vector Machine (SVM), Random
forest (RF), k-Nearest Neighbours (kNN), and Multilayer Perception (MLP). The
performances of the classifiers have been measured using accuracy, balanced accuracy
and F1 score, which are summarised in Table 4.1. In Section A1.4 of Paper A1,
evaluation results with additional metrics are also reported. The results demonstrate
improvement in the performance of the classifier models when utilising AE-extracted
features compared to manually extracted features, underlining the potential of DL
techniques in MWL assessment. Particularly, SVM has achieved 87.00% classification
accuracy when trained with AE-extracted features, whereas the highest accuracy is
70.83% using manually extracted features and MLP classifier.
Table 4.1: Performance of different models for classifying drivers’ MWL utilising
AE-extracted features compared to manually extracted features. For all the measures,
higher values are better.
Feature Classifier
Metric
Extraction kNN MLP RF SVM
AE 0.7737 0.8504 0.8049 0.8700
Accuracy
Manual 0.6420 0.7083 0.6414 0.5388
Balanced AE 0.7737 0.8504 0.8049 0.8700
Accuracy Manual 0.6420 0.7083 0.6414 0.5388
AE 0.7912 0.8527 0.8197 0.8730
F1 score
Manual 0.6486 0.7151 0.6442 0.5146
31
XAI for Enhancing Transparency in DSS
EEG and vehicular data can be represented respectively by the vectors e and v from
corresponding entropy spaces, and m represents a single instance of I(E, V ), which
is the MI shared by e and v, which provides the template to generate the feature set.
This template enables the replication of EEG features and the generation of hybrid
feature sets by introducing new vehicular data repeatedly. A detailed theoretical
description of the MI-based template generation process is presented in Section A2.3.3
of Paper A2.
Figure 4.2: Illustration of shared information between EEG and vehicular signal
spaces (Islam et al., 2020).
To evaluate the effectiveness of this hybrid feature set, it has been utilised
in drivers’ MWL quantification and classification. MLP, RF, and SVM have
been employed to develop classifiers and predictors, with appropriate regression
models used as needed, i.e., Linear Regression (LnR) for quantification and Logistic
Regression (LgR) for classification. The performance of predictors in MWL
quantification is found to be similar for both EEG and MI-based features. However,
when it came to classification, MI-based features outperform EEG features in event
classification, with the exception of LgR. Sections A2.4 and A2.5 of Paper A2 present
and discuss detailed evaluation results for MWL classification and quantification,
including event classification.
32
Chapter 4. Explainable Artificial Intelligence for Decision Support Systems
avg_acce_pedal_pos ≤ 18.225
entropy = 0.488
samples = 66
value = [59, 7]
class = No Risk
False
True
std_steer_angle ≤ 188.938
entropy = 0.0
entropy = 0.887
samples = 43
samples = 23
value = [43, 0]
value = [16, 7]
class = No Risk
class = No Risk
avg_acce_pedal_pos ≤ 32.512
entropy = 0.0
entropy = 0.996
samples = 10
samples = 13
value = [10, 0]
value = [6, 7]
class = No Risk
class = Risky
gsr_phasic ≤ 0.011
entropy = 0.0
entropy = 0.971
samples = 3
samples = 10
value = [0, 3]
value = [6, 4]
class = Risky
class = No Risk
avg_speed ≤ 8.908
entropy = 0.0
entropy = 0.918
samples = 4
samples = 6
value = [4, 0]
value = [2, 4]
class = No Risk
class = Risky
Figure 4.3: Decision Tree for detecting risky driving behaviour. The leaf nodes refer
to risk and no risk in driving behaviour, which are coloured with the darkest shade
of blue and orange, respectively. All other nodes are decision nodes containing the
conditions on corresponding features for splitting the decision paths.
33
XAI for Enhancing Transparency in DSS
avg_acce_pedal_pos ≤ 22.907
entropy = 0.918
samples = 66
value = [22, 44]
class = No Hurry
True False
hr ≤ 76.39
entropy = 0.0 entropy = 0.0 entropy = 0.0
entropy = 0.536
samples = 3 samples = 1 samples = 13
samples = 49
value = [3, 0] value = [0, 1] value = [13, 0]
value = [6, 43]
class = Hurry class = No Hurry class = Hurry
class = No Hurry
hrv_hf ≤ 0.092
entropy = 0.0
entropy = 0.811
samples = 25
samples = 24
value = [0, 25]
value = [6, 18]
class = No Hurry
class = No Hurry
pitch_rate ≤ 0.804
entropy = 0.0
entropy = 0.592
samples = 3
samples = 21
value = [3, 0]
value = [3, 18]
class = Hurry
class = No Hurry
yaw_rate ≤ 0.038
entropy = 0.0
entropy = 0.297
samples = 2
samples = 19
value = [2, 0]
value = [1, 18]
class = Hurry
class = No Hurry
Figure 4.4: Decision Tree for detecting hurried driving behaviour. The leaf nodes
refer to hurry and no hurry in driving behaviour, which are coloured with the darkest
shade of orange and blue, respectively. All other nodes are decision nodes containing
the conditions on corresponding features for splitting the decision paths.
34
Chapter 4. Explainable Artificial Intelligence for Decision Support Systems
Apart from the studies presented in Paper D, a separate experiment has been
conducted to assess the influential features in drivers’ behaviour classification for risk
and hurriedness. In this experiment, additional features from Electrocardiography
(ECG) and Galvanic Skin Response (GSR) signals have been incorporated
commencing to the requirements of the project SIMUSAFE. In practice, rule-based
explanations are generated to investigate the features that lead to particular
classifications of risky and hurried driving behaviours. The rule-based explanations
are extracted from separate Decision Trees (DT) trained for each of the classification
tasks. The DTs for classifying risky and hurried driving behaviours are illustrated in
Figure 4.3 and 4.4, respectively.
35
XAI for Enhancing Transparency in DSS
predicting flight TOT delay are developed with popular XAI methods, i.e., LIME
(Ribeiro et al., 2016) and SHAP (Lundberg & Lee, 2017). Another explainability
tool, namely Model-Agnostic Language for Exploration and Explanations (DALEX)
(Biecek, 2018), is exploited alongside SHAP and LIME for comparative analysis of
the output from the XAI methods.
Figure 4.6: Performances of ETFMS, GBDT, RF, and XGBoost for flight TOT delay
prediction in terms of MAE in minutes measured at different time intervals to EOBT
in minutes. For MAE, a lower value is better, and the plots for ETFMS and GBDT
are considered as a reference from experimentation performed by Dalmau et al. (2021).
In this study, for the developed explainable model for flight TOT delay prediction,
the data acquisition in Steps 1 and 2 from Figure 4.5 are described in Section 3.1.2.
Two different regression models have been built using RF and Extreme Gradient
Boosting (XGBoost) in Step 3. The built models have been quantitatively evaluated
by comparing their performances with the Enhanced Tactical Flow Management
System (ETFMS) and GBDT developed for the same task by Dalmau et al. (2021).
The developed models in this study outperform the reference models as illustrated
in Figure 4.6. However, XGBoost has been chosen over RF while developing
36
Chapter 4. Explainable Artificial Intelligence for Decision Support Systems
Table 4.2: Local accuracy in terms of MAE and nDCG values for SHAP and LIME
while explaining flight TOT delay prediction. The result is presented for all the test
instances and the top 100,000 instances where the XGBoost model predicted the lowest
error. For MAE, a lower value is better, and for nDCG, a higher value is better. The
best values are highlighted with bold fonts.
XAI methods to explain the flight TOT prediction in Step 4. The explainable
methods using SHAP and LIME have been evaluated quantitatively, where SHAP
outperforms LIME. The result of the quantitative evaluation is summarised in Table
4.2. DALEX has not been included in the quantitative evaluation since it only
produces visualisation with its internal values. Finally, in Step 5, three different
explanations have been generated using SHAP (Figure 4.7), LIME (Figure 4.8) and
DALEX (Figure 4.9). These explanations have been evaluated through a user survey
conducted among practising and student ATCOs. The survey protocol and detailed
description of the entire research study for developing and evaluating the explainable
flight TOT delay prediction model are disseminated in a co-authored work (Jmoona
et al., 2023).
Figure 4.7: Explanation for a single instance of flight TOT delay prediction with
feature contributions extracted from SHAP.
Advancement of XAI research covers a substantial part of the goal of this doctoral
thesis. To attain this, two different methods of generating explanations for AI or
ML models’ decisions are developed. Furthermore, a robust approach is proposed to
evaluate the AFA methods using a synthetic dataset that captures the underlying
37
XAI for Enhancing Transparency in DSS
Figure 4.8: Explanation for a single instance of flight TOT delay prediction with
feature contributions extracted from LIME.
Figure 4.9: Explanation for a single instance of flight TOT delay prediction with
feature contributions extracted from DALEX.
behaviour of the data. All of these proposed methods are briefly discussed in the
following sections and referred to the corresponding papers included in this thesis.
38
Chapter 4. Explainable Artificial Intelligence for Decision Support Systems
producing AFA using synthetic datasets where AddCBR is used as the baseline to
evaluate the XAI methods comparatively.
Figure 4.10: Schematic diagram of evaluating XAI methods for AFA using synthetic
dataset.
The steps shown in Figure 4.10 are briefly described below, and they are presented
in their entirety in Paper E.
39
XAI for Enhancing Transparency in DSS
40
Chapter 5
This chapter presents the summaries of the included papers, the authors’
contributions, and the significant findings.
The papers included in this thesis comprise three journal papers and four
peer-reviewed conference papers. Five of these papers have already been published,
while Papers E and F are under review for publishing in a journal and a conference,
respectively. The subsequent sections present the summary and key findings of the
included papers with the title, authors’ contributions, and publication details. In
addition, the presented contributions in the corresponding papers are mapped with
the research contributions (RC) of this doctoral research, which are described in
Section 1.4.
5.1 Paper A1
Authors. Islam, M. R., Barua, S., Ahmed, M. U., Begum, S. & Di Flumeri, G.
Authors’ Contributions. Islam is the main author of the paper. He developed the
methodology, executed the implementation, analyzed the results, and wrote the paper.
Barua contributed to the study design and experiments. Ahmed and Begum guided
the study and manuscript preparation. Di Flumeri helped acquire the data and
provided feedback on the methodology as an expert in physiological signal processing.
41
XAI for Enhancing Transparency in DSS
Summary. The study presented in this paper was initially motivated by the
need to classify drivers’ mental workload (MWL) intended for applications in
Road Safety (RS) using physiological measures, particularly Electroencephalography
(EEG) signals that are considered a suitable measure for MWL. The study was
further influenced by the urge to automate the feature extraction techniques from
EEG signals, reducing manual methods. The paper explored the use of Deep
Learning (DL) algorithms for automatic feature extraction from the EEG signals
to classify drivers’ MWL. It presents a comparative study on DL-based feature
extraction techniques, specifically the Convolutional Neural Network Autoencoder
(CNN-AE), with traditional manual methods. The results demonstrate that
the CNN-AE approach outperforms traditional methods in terms of classification
accuracy. Particularly, four different models – Support Vector Machine (SVM),
k-Nearest Neighbours (kNN), Random Forest (RF) and Multi-layer Perceptron
(MLP) were used to classify MWL in combination with both CNN-AE and traditional
feature extraction methods. The key results of the evaluation experiments reveal
that the highest value for the Area Under the Receiver Operating Characteristic
Curve (AUC-ROC) reached 0.94 while using features extracted by CNN-AE with
an SVM classifier. In contrast, traditional feature extraction methods yielded a
maximum AUC-ROC of 0.78 with an MLP classifier. Thus, this study highlights the
potential of DL techniques in easing the EEG feature extraction techniques and their
application in real-time scenarios to classify MWL, with implications for monitoring
human participants in various safety-critical domains.
5.2 Paper A2
Title. A Novel Mutual Information Based Feature Set for Drivers’ Mental Workload
Evaluation using Machine Learning (Islam et al., 2020).
Authors. Islam, M. R., Barua, S., Ahmed, M. U., Begum, S., Aricò, P., Borghini,
G. & Di Flumeri, G.
Authors’ Contributions: Islam is the main author of the paper. He developed the
methodology, performed the formal analysis, executed the implementation, analyzed
the results, and prepared the original draft of the paper. Barua contributed to
conceptualising the methodology and participated in the discussion while writing
42
Chapter 5. Summary of the Included Papers
the paper. Ahmed and Begum supervised the study and provided feedback on the
manuscript. Aricò, Borghini, and Di Flumeri acquired and curated the recorded EEG
signals used in this study and reviewed the paper as experts in physiological signal
processing.
Summary. Using EEG signals in MWL assessment while driving requires frequent
use of invasive recording equipment on drivers. Moreover, the features extracted from
the EEG signals are less interpretable for general users. To mitigate these issues,
this paper is motivated to develop a novel methodology for creating a feature set by
fusing EEG and vehicular signals together and utilizing the feature set in assessing
drivers’ MWL. The findings of this study include significant changes in MWL due to
different driving environments and patterns reflected in vehicular signals. With these
vehicular signals recorded live while driving and the predefined template containing
the Mutual Information (MI) between EEG and vehicular signals, a hybrid feature set
is generated for drivers’ MWL quantification and classification. The study compared
the performance of different Machine Learning (ML) algorithms, such as Linear and
Logistic Regression, MLP, RF and SVM, in corresponding tasks of MWL assessment
and classifying events. In these tasks, both MI- and EEG-based feature sets were
used. The results of MWL assessment tasks demonstrate that the performances of
the ML models are similar while using MI- and EEG-based feature sets. However,
the result of event classification is better while using the MI-based features. On the
contrary, the outcome of a statistical analysis on the performance of classification
tasks suggests that the SVM classifier with MI-based features performed significantly
better in both tasks compared to the other classifiers. This indicates that using
MI-based features can be a viable alternative to EEG-based features for evaluating
MWL and classifying events in driving scenarios.
5.3 Paper B
43
XAI for Enhancing Transparency in DSS
Authors’ Contributions. Islam, being the paper’s main author, planned and
conducted the literature review, prepared the original manuscript, and wrote the
discussion section by consulting with the co-authors. Ahmed and Barua suggested and
scrutinized several articles to include in the review study. Begum provided feedback
while preparing the manuscript.
Summary. The research on XAI has emerged with various studies exploring the
philosophy and methodologies of explaining AI models. Despite this, there remains
a noticeable dearth of secondary studies focused on the application domains and
tasks, serving as an entry point for researchers from diverse fields to integrate
XAI methods. To fill this gap, this paper presents a systematic literature review
of recent developments in XAI methods and evaluation metrics across various
application domains and tasks. The analysis covers 137 articles identified from
prominent bibliographic databases, providing several key insights. The findings
reveal a predominant development of XAI methods for safety-critical domains like
healthcare, with comparatively less attention given to domains such as judiciary, road
safety, aviation, etc. Additionally, DL and ensemble models are more prevalent than
other AI or ML models. Visual explanations prove more acceptable to end-users,
while robust evaluation metrics for assessing explanation quality are still in the
developmental stage.
5.4 Paper C
Authors’ Contributions. Islam led the study and is the paper’s main author.
He developed the methodology, conducted the experiments, and prepared the original
manuscript. Ahmed and Begum supervised the study and provided feedback on the
manuscript preparation.
44
Chapter 5. Summary of the Included Papers
Summary. The motivation of the paper is to address the need for Explainable
Artificial Intelligence (XAI) in the context of mental workload classification using
EEG data. The authors propose a hybrid approach that utilizes MI to explain the
inference mechanism and decisions of AI or ML models. This approach involves
a convolutional autoencoder for feature extraction, a classification model, and the
use of MI to provide global and local interpretability. The study demonstrates the
application of this approach in classifying drivers’ mental workload using EEG data,
showing promising performance accuracy and the ability to explain the model’s
behaviour using MI and SHAP values. The paper highlights the potential of the
proposed approach in providing interpretable explanations for the model’s decisions
and the need for further research to explore other DL architectures and improve the
quality of explanations.
5.5 Paper D
Title. Interpretable Machine Learning for Modelling and Explaining Car Drivers’
Behaviour: An Exploratory Analysis on Heterogeneous Data (Islam et al., 2023).
Authors’ Contributions. As the main author of the paper, Islam led the study,
developed the methodology, conducted the experiments, and prepared the original
manuscript. Ahmed and Begum provided general supervision and feedback on the
paper writing.
Summary. The paper presents a study that explores the variation of drivers’
behaviour in a simulator and track driving to enhance simulator technologies, which
are widely used in the domain of RS. The study includes a comparative analysis of
car drivers’ behaviour in a simulator and track driving for different traffic situations.
The outcome of the comparative analysis identifies biases and differences in driving
behaviours between the two driving environments. Five different ML classifier models
(i.e., Gradient Boosted Decision Trees (GBDT), Logistic Regression, MLP, RF and
SVM) are developed to classify risk and hurry in drivers’ behaviour. The results
demonstrate that among the classifiers, GBDT performed best with a classification
accuracy of 98.62%. The study also develops explanation models based on additive
feature attribution (AFA) to explain the decisions made by the classifier models.
These explanations provide insights into the factors and features contributing to risky
or hurried driving behaviour, allowing for a better understanding of the underlying
45
XAI for Enhancing Transparency in DSS
causes. Lastly, the study proposes a system for drivers’ behaviour monitoring in
simulated driving. This system includes features, e.g., Global Positioning System
(GPS) plots, heatmaps, and event markers to visualize driving behaviour. It also
provides explanations for specific risky or hurried events, allowing for targeted
feedback and instruction to modify drivers’ behaviour and create a safer road
environment. Overall, this study contributes to the enhancement of simulator
technologies by identifying biases and differences in driving behaviour. It also
provides a framework for developing driver monitoring systems that can detect and
classify risky or hurried driving behaviour, as well as explain the underlying factors
contributing to these behaviours, thus allowing for targeted interventions and training
to improve RS.
5.6 Paper E
Authors’ Contributions. Islam led the study and is the main author of the paper.
He designed the study, performed implementation, analyzed the results, prepared the
illustrations, and wrote the whole paper. Weber guided in designing the study, result
analysis and presentation, and manuscript preparation. Ahmed and Begum provided
general supervision and feedback on the manuscript.
Summary. The literature on XAI has produced studies showing that explainable
methods for feature attribution produce inconsistent results. This inconsistency in
explanations makes the evaluation of XAI methods crucial, but the existing body
of literature on evaluation techniques is still immature, with multiple proposed
techniques and lacks a consensus on the best approaches for each circumstance.
Moreover, there is a lack of widely accepted evaluation methods for explaining
the decisions of AI algorithms. This paper investigates an approach to creating
synthetic data that can be used to evaluate methods that explain the decisions of
AI algorithms. From a real-world dataset, the proposed approach describes how to
create synthetic data that preserves the patterns of the original data and enables
comprehensive evaluation of XAI methods. Particularly, the proposed approach is
described for the explainable methods that produce AFA to describe the contribution
of individual features in decision-making. The application of the proposed approach
is illustrated in predicting flight take-off (TOT) delays. The results of primary
and sensitivity analysis show that the performances of the AFA methods align
46
Chapter 5. Summary of the Included Papers
with previous literature for regression tasks. Additionally, the additive form of
Case-based Reasoning (CBR), namely AddCBR, is derived. Evaluations in the
paper demonstrate that AddCBR serves as a suitable benchmark for evaluating AFA
methods. In the entirety, this paper contributes to the advancement of evaluation
techniques for XAI methods and provides insights into the performance of AFA
methods using the proposed synthetic data approach.
5.7 Paper F
Title. iXGB: Improving the Interpretability of XGBoost using Decision Rules and
Counterfactuals.
Status. Under review for publishing at the 16th International Conference on Agents
and Artificial Intelligence (ICAART), 2024.
47
XAI for Enhancing Transparency in DSS
the original data. On the other hand, iXGB aims to improve the interpretability
of XGBoost by generating a set of rules from its internal structure and the original
data characteristics. The approach also includes generating counterfactuals to aid in
understanding the operational relevance of the rules. The paper presents experiments
on both real and benchmark datasets, demonstrating reasonable interpretability of
iXGB without using surrogate methods.
48
Chapter 6
This chapter discusses the findings in the context of the research questions,
summarises the study’s main findings, presents the limitations of this study,
and suggests areas for future research.
This doctoral research has been conducted with the aim of advancing the research for
Explainable Artificial Intelligence (XAI) and enhancing the transparency of Decision
Support Systems (DSS). During the study, several explainable models are developed
and evaluated, which are intended for different expert systems, in a broader sense,
DSS, for the selective domains of Road Safety (RS) and Air Traffic Flow Management
(ATFM) to further the developments of DSS. The following sections present the
discussions on the findings of the doctoral study, including the answers to the RQs
defined in Section 1.3, present the challenges and limitations of this study, followed
by the concluding statements with directions to future research.
6.1 Discussions
49
XAI for Enhancing Transparency in DSS
the design of DSS, allowing users to tailor the level of transparency based on their
specific requirements. Achieving transparency by design involves integrating clear
explanations of AI models, thus enhancing trust in the intelligent systems (Miller,
2019). In this thesis, transparency is achieved through a series of exploratory research
studies, ensuring that the outcomes are not only effective but also comprehensible to
users in the domains of RS and ATFM. These domains are chosen for this doctoral
research since the studies have been supported by the research projects mentioned
in Chapter 3.
Initially, within the frameworks of the projects SIMUSAFE and BrainSafeDrive
from the domain of RS, the study has revolved around assessing drivers’ in-vehicle
state and behaviours. The initial research works address the need to classify
drivers’ mental workload (MWL) with a focus on applications in RS. It leveraged
physiological measures, particularly Electroencephalography (EEG) signals, which
are deemed suitable for MWL assessment. A key motivation is to automate the
feature extraction process from EEG signals, reducing reliance on manual methods.
Given the challenges of using EEG signals for MWL assessment, including the invasive
nature of recording equipment and less interpretable features, the study introduces
a novel methodology. This method involves creating a feature set by combining
EEG and vehicular signals, aiming to enhance the interpretability and efficiency of
assessing drivers’ MWL from the perspective of both features and decisions. The
project ARTIMATION dealt with the ATFM as a specific application domain in
Aviation. The study for ATFM explores the use of XAI methods in explaining flight
take-off time (TOT) delay to Air Traffic Controllers (ATCO) predicted by ML-based
predictive models. Here, three post-hoc explanation methods are employed to explain
the models’ predictions. Quantitative and user evaluations are conducted to assess
the acceptability and usability of the XAI methods in explaining the predictions to
ATCOs as suggested in the literature (Liu et al., 2021; Troncoso-García et al., 2023).
After conducting a series of studies involving the development of explainable
and interpretable models for RS and ATFM with different XAI methods from
the literature, various inconsistencies in the existing XAI methods and evaluation
approaches have been observed. These inconsistencies are also evident in the
literature, such as the limitations of using XAI methods for regression problems,
whereas they are designed for classification problems (Letzgus et al., 2022). By
addressing these inconsistencies, the research in XAI is advanced through the
later experiments in this doctoral study. Particularly, novel methods for Additive
Feature Attribution (AFA), rule and counterfactual explanations are proposed. The
performances of the proposed models are presented in the corresponding papers that
demonstrate better output than the existing methods from the literature. In addition,
a robust method of evaluating AFA methods is also put forward to address the need
for a plausible consensus of evaluation approach in XAI (J. Zhou et al., 2021).
As a whole, the doctoral thesis produced explainable models for applications
of two different safety-critical domains. In addition, novel XAI methods and an
evaluation approach are developed, which contribute to the core body of research in
XAI. All the outcomes of this thesis are concentric on enhancing transparency in DSS,
which is a key requirement that AI systems should meet in order to be trustworthy
to end-users (AI HLEG - European Commission, 2019).
To summarise the outcome of this doctoral research, the answers to the asserted
RQs are discussed in this section with references to the corresponding sections and
50
Chapter 6. Discussions, Conclusion and Future Works
included papers. Additionally, the issues raised while conducting the presented
research studies of this thesis are discussed, followed by the limitations of this doctoral
research study.
51
XAI for Enhancing Transparency in DSS
By design, the RQ spans several aspects of DSS, such as data transparency and
XAI methods for generating explanations and their evaluation. In order to address
each aspect, particular sub-RQs have been formulated that collectively correspond
to the RC2 and RC3 of this thesis. These contributions are concentric to the
development of explainable models that are both domain-specific and independent.
The following subsections present the discussion on each of the sub-RQs of RQ2.
52
Chapter 6. Discussions, Conclusion and Future Works
53
XAI for Enhancing Transparency in DSS
Dataset Acquisition. The datasets exploited in this doctoral study are from
two different domains, which were acquired within the framework of three different
research projects. For each of the datasets, the acquisition procedure was different,
as described in Section 3.1. In addition, the nature and content of the raw data
were also diverse across the domains. For the data from RS, the features include
vehicular signals, physiological signals and annotations from domain experts. On
the other hand, the data for ATFM contained different features related to aviation
operations. For all the datasets, domain-specific knowledge regarding the features
was required to preprocess and exploit the acquired data for the experimental studies
presented in this thesis. Moreover, additional efforts were required to adapt to
54
Chapter 6. Discussions, Conclusion and Future Works
different data repositories like IBM Cloud1 and EUROCONTROL Aviation Data
for Research Repository2 and their corresponding data formats. These challenges
have been resolved through consultation with the respective domain experts from
the collaborating institutes in the research projects. It is worth mentioning that,
during the doctoral studies, a data collection experiment was planned within the
framework of the project SIMUSAFE, the protocol was designed, and ethical approval
was received from respective authorities. However, the experiment was postponed
due to the Coronavirus pandemic. Nevertheless, the study protocol is disseminated
in a co-authored article (Ahmed et al., 2021).
Choice of XAI Methods. In the research studies of this thesis work, the
explanations are generated using AFA methods, i.e., LIME and SHAP. However,
there remains a consensus on appropriate evaluation metrics to evaluate the quality
of feature attributions from the adopted methods. This is because of the fact that
there is a lack of ground truth or ideal attribution values to evaluate the AFA methods
(Y. Zhou et al., 2022). As a consequence, the literature contains the use of different
metrics or methods for evaluating AFA methods. The use of gold features by Ribeiro
et al. (2016) was the closest form of ground truth, which are the most important
features used by the prediction models. To mitigate the issue of ideal evaluation
1 https://www.ibm.com/cloud
2 https://www.eurocontrol.int/dashboard/rnd-data-archive
55
XAI for Enhancing Transparency in DSS
criteria, this thesis work devised the evaluation method presented in Paper E, which
relies on the synthetic dataset that contains the behaviour of the data. Here, the
behaviour of the data is hypothesized as the ground truth or benchmark that should
be followed in feature attribution.
6.1.4 Limitations
This doctoral thesis work is a combination of several studies. These studies are not
mutually exclusive in their entirety and in the nature of their applications. However,
the studies have several limitations that are briefly stated below.
The study presented in Paper A1, resulted in the development of an AE to extract
features from the EEG signals. The working principle of the AE is confined to
extracting EEG features only without any provision to extract features from other
physiological signals such as electrocardiography or galvanic skin response signals.
Moreover, the dataset acquired for the study is suitable for classification tasks by the
experimental design. Thus, the dataset only facilitated the evaluation of the features
extracted from AE on classification tasks.
Different ML algorithms were invoked to develop models for quantifying and
classifying drivers’ MWL and event classification in Paper A2. These models were
used to assess the effectiveness of the developed hybrid feature set through evaluation
carried out by comparing the performances of the models with the other models
developed in the study. This absence of comparison with baseline models remains a
limitation of this study.
Explainable models are developed for drivers’ MWL assessment and driving
behaviour monitoring, which are presented in Papers C and D, respectively. The
3 https://www.enac.fr
56
Chapter 6. Discussions, Conclusion and Future Works
explanations generated from the explainable models are quantitatively evaluated only
because of the unavailability of appropriate users of the developed systems. However,
it is suggested in the literature that user evaluation provides valuable insights into
evaluating explanation methods through factors such as expertise, understanding,
and personal preferences (B. Kim et al., 2018; Nauta et al., 2023). The limitations of
the studies developing explainable models for RS concerns the qualitative evaluation
of the models.
Paper E presents the AddCBR that is created as a functionally equivalent model
to XGBoost. During the process, the feature importance values from XGBoost are
used as the weights for the CBR model before transforming it to the additive form.
In this step, XGBoost can be replaced by any other model that produces similar
importance values for the features. However, AddCBR can not produce feature
attribution for the decisions of the models that train on an abstract representation
of features (e.g., NN). This remains a limitation of this thesis work.
The exploratory research studies presented in this thesis work advance the XAI
research by developing explainable applications for the DSS from two safety-critical
domains, i.e., RS and ATFM. Besides, novel methods of explaining AI models’
decisions are proposed, including a robust approach for evaluating the AFA methods.
In addition, for the domain of RS, reducing the resource usage for monitoring
drivers while driving and extracting human-understandable features are noteworthy
outcomes of this doctoral research. The findings of the presented studies highlight
the potential of XAI methods in creating transparent DSS for different safety-critical
domains other than RS and ATFM. Though the transparency of the DSS is subjective
to the end-users, it can be enhanced from the algorithmic point of view through
theoretically grounded evaluation approaches, which is demonstrated in this thesis
work.
This doctoral thesis is comprised of several studies from different perspectives
with a common goal of advancing the XAI research. Hence, the recommended future
works also circulate around the advancement of the developed XAI methods and
their evaluation approaches, which are outlined below.
• In Paper E, synthetic data is generated to capture the intrinsic behaviour of
the data, which is used to evaluate the AFA methods. During the process of
synthetic data generation, different behaviours in the data are identified by
the density-based clustering method. It would be interesting to investigate
other unsupervised methods (e.g., AE) or generative modelling methods (e.g.,
Generative Adversarial Networks) for identifying the different behaviours in the
data and generating the synthetic data.
• One of the limitations of this thesis is about the initial feature weights for
creating AddCBR. In the reported study, AddCBR receives the feature weights
from a tree-based model that restricts its incorporation with data models of
different working principles. Potential future research could be to formulate
methods of extracting the initial feature weights from models other than the
tree-based models, both singular and ensembles.
57
XAI for Enhancing Transparency in DSS
• In this thesis, two different XAI methods are proposed: AddCBR for generating
AFA from the predictions of tree-based models and iXGB for generating rules
and counterfactuals from predictions of XGBoost. The potential of these
methods is demonstrated for regression tasks. Yet, these models need to be
implemented and evaluated for binary and multi-class classification tasks.
• Both the methods, AddCBR and iXGB, are functionally evaluated in this thesis.
Besides, it is established that their performances in explanation generation
are better than those of similar methods. In future works, formal analyses of
the designs and performances of the proposed methods would establish their
completeness.
58
Bibliography
59
XAI for Enhancing Transparency in DSS
60
Bibliography
61
XAI for Enhancing Transparency in DSS
62
Bibliography
63
XAI for Enhancing Transparency in DSS
Liu, Y., Khandagale, S., Khandagale, S., White, C., & Neiswanger, W. (2021).
Synthetic Benchmarks for Scientific Research in Explainable Machine
Learning. In J. Vanschoren & S. Yeung (Eds.), Proceedings of the Neural
Information Processing Systems - Track on Datasets and Benchmarks
(NeurIPS Datasets and Benchmarks).
Loyola-Gonzalez, O. (2019). Black-Box vs. White-Box: Understanding Their
Advantages and Weaknesses From a Practical Point of View. IEEE Access,
7, 154096–154113.
Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model
Predictions. Proceedings of the 31st International Conference on Neural
Information Processing Systems (NeurIPS), 4768–4777.
Luxburg, U. V., & Schölkopf, B. (2011). Statistical Learning Theory: Models,
Concepts, and Results. In Handbook of the History of Logic (pp. 651–706).
Elsevier.
Mainali, M., & Weber, R. O. (2023). What’s meant by Explainable Model: A
Scoping Review. Proceedings of the Workshop on XAI co-located with the
32nd International Joint Conference on Artificial Intelligence (IJCAI).
Man, X., & Chan, E. P. (2021). The Best Way to Select Features? Comparing MDA,
LIME, and SHAP. The Journal of Financial Data Science, 3 (1), 127–139.
Martens, D., & Provost, F. (2014). Explaining Data-Driven Document Classifications.
MIS Quarterly, 38 (1), 73–99.
Mase, J. M., Agrawal, U., Pekaslan, D., Mesgarpour, M., Chapman, P., Torres, M. T.,
& Figueredo, G. P. (2020). Capturing Uncertainty in Heavy Goods Vehicles
Driving Behaviour. 2020 IEEE 23rd International Conference on Intelligent
Transportation Systems (ITSC), 1–7.
Mercado, J. E., Rupp, M. A., Chen, J. Y. C., Barnes, M. J., Barber, D., & Procci,
K. (2016). Intelligent Agent Transparency in Human–Agent Teaming for
Multi-UxV Management. Human Factors, 58 (3), 401–415.
Miller, T. (2019). Explanation in Artificial Intelligence: Insights from the Social
Sciences. Artificial Intelligence, 267, 1–38.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.
Mothilal, R. K., Sharma, A., & Tan, C. (2020). Explaining Machine Learning
Classifiers through Diverse Counterfactual Explanations. Proceedings of the
2020 Conference on Fairness, Accountability, and Transparency (FAT*),
607–617.
Mueller, S. T., Hoffman, R. R., Clancey, W. J., Emery, A. K., & Klein, G. (2019).
Explanation in Human-AI Systems: A Literature Meta-Review Synopsis of
Key Ideas and Publications and Bibliography for Explainable AI (tech. rep.).
Defense Advanced Research Projects Agency (DARPA). Arlington, VA,
USA.
Nauta, M., Trienes, J., Pathak, S., Nguyen, E., Peters, M., Schmitt, Y., Schlötterer,
J., Van Keulen, M., & Seifert, C. (2023). From Anecdotal Evidence to
Quantitative Evaluation Methods: A Systematic Review on Evaluating
Explainable AI. ACM Computing Surveys, 55 (13s), 1–42.
Negnevitsky, M. (2004). Artificial Intelligence: A Guide to Intelligent Systems
(2nd ed.). Addison-Wesley.
64
Bibliography
65
XAI for Enhancing Transparency in DSS
Solovey, E. T., Zec, M., Garcia Perez, E. A., Reimer, B., & Mehler, B. (2014).
Classifying Driver Workload using Physiological and Driving Performance
Data: Two Field Studies. Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (CHI), 4057–4066.
Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic Attribution for Deep
Networks. Proceedings of the 34th International Conference on Machine
Learning (ICML), 70, 3319–3328.
Tintarev, N., Rostami, S., & Smyth, B. (2018). Knowing the Unknown: Visualising
Consumption Blind-spots in Recommender Systems. Proceedings of the 33rd
Annual ACM Symposium on Applied Computing (SAC), 1396–1399.
Todd, P., & Benbasat, I. (1992). The Use of Information in Decision Making: An
Experimental Investigation of the Impact of Computer-Based Decision Aids.
MIS Quarterly, 16 (3), 373.
Tran, T.-N., Pham, D.-T., Alam, S., & Duong, V. (2020). Taxi-speed Prediction
by Spatio-temporal Graph-based Trajectory Representation and Its
Application. Proceedings of International Conference for Research in Air
Transportation (ICART).
Troncoso-García, A. R., Martínez-Ballesteros, M., Martínez-Álvarez, F., & Troncoso,
A. (2023). A New Approach based on Association Rules to Add
Explainability to Time Series Forecasting Models. Information Fusion, 94,
169–180.
Tzallas, A., Tsipouras, M., & Fotiadis, D. (2009). Epileptic Seizure Detection in
EEGs Using Time-Frequency Analysis. IEEE Transactions on Information
Technology in Biomedicine, 13 (5), 703–710.
van der Waa, J., Robeer, M., van Diggelen, J., Brinkhuis, M., & Neerincx, M. (2018).
Contrastive Explanations with Local Foil Trees. Proceedings of the Workshop
on Human Interpretability in Machine Learning (WHI) co-located with the
35th International Conference on Machine Learning (ICML).
Vapnik, V. (1991). Principles of Risk Minimization for Learning Theory. In J. Moody,
S. Hanson, & R. P. Lippmann (Eds.), Advances in Neural Information
Processing Systems. Morgan-Kaufmann.
Vilone, G., & Longo, L. (2020). Explainable Artificial Intelligence: A Systematic
Review. ArXiv, (arXiv:2006.00093v4 [cs.AI]).
Wachter, S., Mittelstadt, B., & Russell, C. (2018). Counterfactual Explanations
without Opening the Black Box: Automated Decisions and the GDPR.
Harvard Journal of Law & Technology, 31 (2), 841–887.
Wei, Z., Wu, C., Wang, X., Supratak, A., Wang, P., & Guo, Y. (2018). Using Support
Vector Machine on EEG for Advertisement Impact Assessment. Frontiers in
Neuroscience, 12.
World Medical Association. (2001). World Medical Association Declaration of
Helsinki: Ethical Principles for Medical Research Involving Human Subjects.
Bulletin of the World Health Organization, 79 (4), 373.
Xu, F., Uszkoreit, H., Du, Y., Fan, W., Zhao, D., & Zhu, J. (2019). Explainable AI:
A Brief Survey on History, Research Areas, Approaches and Challenges. In
J. Tang, M.-Y. Kan, D. Zhao, S. Li, & H. Zan (Eds.), Natural Language
Processing and Chinese Computing (pp. 563–574). Springer International
Publishing.
66
Bibliography
Yang, F., Du, M., & Hu, X. (2019). Evaluating Explanation Without Ground Truth
in Interpretable Machine Learning. ArXiv, (arXiv:1907.06831v2 [cs.LG]).
Yang, M., & Kim, B. (2019). Benchmarking Attribution Methods with Relative
Feature Importance. ArXiv, (arXiv:1907.09701 [cs.LG]).
Yang, W., Li, J., Xiong, C., & Hoi, S. C. H. (2022). MACE: An
Efficient Model-Agnostic Framework for Counterfactual Explanation. ArXiv,
(arXiv:2205.15540v1 [cs.AI]).
Yeation, W. H., Langenbrunner, J. C., Smyth, J. M., & Wortman, P. M.
(1995). Exploratory Research Synthesis: Methodological Considerations
for Addressing Limitations in Data Quality. Evaluation & the Health
Professions, 18 (3), 283–303.
Young, M., Varpio, L., Uijtdehaage, S., & Paradis, E. (2020). The Spectrum
of Inductive and Deductive Research Approaches Using Quantitative and
Qualitative Data. Academic Medicine, 95 (7), 1122–1122.
Yu, B., Guo, Z., Asian, S., Wang, H., & Chen, G. (2019). Flight Delay Prediction
for Commercial Air Transport: A Deep Learning Approach. Transportation
Research Part E: Logistics and Transportation Review, 125, 203–221.
Zar, J. H. (1972). Significance Testing of the Spearman Rank Correlation Coefficient.
Journal of the American Statistical Association, 67 (339), 578–580.
Zhang, Z., & Jung, C. (2021). GBDT-MO: Gradient-Boosted Decision Trees for
Multiple Outputs. IEEE Transactions on Neural Networks and Learning
Systems, 32 (7), 3156–3167.
Zhou, F., Alsaid, A., Blommer, M., Curry, R., Swaminathan, R., Kochhar,
D., Talamonti, W., & Tijerina, L. (2022). Predicting Driver Fatigue in
Monotonous Automated Driving with Explanation using GPBoost and
SHAP. International Journal of Human–Computer Interaction, 38 (8),
719–729.
Zhou, J., Gandomi, A. H., Chen, F., & Holzinger, A. (2021). Evaluating the Quality
of Machine Learning Explanations: A Survey on Methods and Metrics.
Electronics, 10 (5), 593.
Zhou, Y., Booth, S., Ribeiro, M. T., & Shah, J. (2022). Do Feature Attribution
Methods Correctly Attribute Features? Proceedings of the 36th AAAI
Conference on Artificial Intelligence, 36(9), 9623–9633.
67
Part II
Included Papers
69
A1
Paper A1
Abstract
In the pursuit of reducing traffic accidents, drivers’ mental workload (MWL)
has been considered as one of the vital aspects. To measure MWL in
different driving situations Electroencephalography (EEG) of the drivers has
been studied intensely. However, in the literature, mostly, manual analytic
methods are applied to extract and select features from the EEG signals
to quantify drivers’ MWL. Nevertheless, the amount of time and effort
required to perform prevailing feature extraction techniques leverage the
need for automated feature extraction techniques. This work investigates
deep learning (DL) algorithm to extract and select features from the EEG
signals during naturalistic driving situations. Here, to compare the DL
based and traditional feature extraction techniques, a number of classifiers
have been deployed. Results have shown that the highest value of area
under the curve of the receiver operating characteristic (AUC-ROC) is
0.94, achieved using the features extracted by convolutional neural network
autoencoder (CNN-AE) and support vector machine. Whereas, using the
features extracted by the traditional method, the highest value of AUC-ROC
is 0.78 with the multi-layer perceptron. Thus, the outcome of this study
shows that the automatic feature extraction techniques based on CNN-AE
can outperform the manual techniques in terms of classification accuracy.
† © Springer Nature Switzerland AG 2019. Reprinted, with permission, from Islam, M. R.,
Barua, S., Ahmed, M. U., Begum, S., & Di Flumeri, G. (2019). Deep Learning for Automatic
EEG Feature Extraction: An Application in Drivers’ Mental Workload Classification. In L. Longo
& M. C. Leva (Eds.), Human Mental Workload: Models and Applications. H-WORKLOAD
2019. Communications in Computer and Information Science (pp. 121–135). Springer Nature
Switzerland.
73
XAI for Enhancing Transparency in DSS
A1.1 Introduction
Driver’s mental workload (MWL) plays a crucial role on the driving performance.
Due to excessive MWL, drivers undergo a complex state of fatigue which manifests
lack of alertness and reduces performance (Kar et al., 2010). Consequently, drivers
are prone to committing more mistakes due to increased MWL. It has been revealed
that human error is the prime cause of around 72% road accidents per year (Thomas
et al., 2013). So, increased MWL of drivers during driving can produce errors leading
to fatal accidents. Driving is a complex and dynamic activity involving secondary
tasks, i.e., simultaneous cognitive, visual and spatial tasks. Diverse secondary tasks
along with natural driving in addition to different road environments increase the
MWL of drivers which lead to errors in traffic situations (Kim et al., 2018). The
alarming number of traffic accidents due to increased MWL leverages the need
of determining drivers’ MWL efficiently. Several research works have identified
mechanisms to measure drivers’ MWL while driving both in simulated and real
environments (Brookhuis & de Waard, 2010; Kar et al., 2010; Almahasneh et al.,
2015). Methods of measuring MWL can be clustered into three main classes; i)
subjective measures, i.e., NASA Task Load Index (NASA-TLX), workload profile
(WP) etc., ii) task performance measures, e.g., time to complete a task, reaction time
to secondary task etc. and iii) physiological measures, e.g., electroencephalography
(EEG), heart rate measures etc. (Moustafa et al., 2017). The latter, with respect to
traditional subjective measures, are intrinsically objective and can be gathered along
with the task without asking any additional action to the user. Also, with respect
to performance measures, physiological measures do not require as well secondary
tasks and are generally able to predict a mental impairment, while on the contrary
performance generally degrades when the user is already overloaded (Begum & Barua,
2013; Aricò et al., 2016; Aricò et al., 2017). Due to the vast availability of measuring
technology, portability and capability of indicating neural activation clearly, the
major concern of this work is the physiological measures, specifically, EEG. With
the increase of data storage and computation power data-driven machine learning
(ML) techniques have been becoming popular means of quantifying MWL from EEG
signals.
Relevant features extracted from the EEG signals are the sine qua nons for
quantifying MWL. Currently, feature extraction is done using theory driven manual
analytic methods that demand huge time and effort (Tzallas et al., 2009; Ahmad
et al., 2014). The proposed work aims at exploring a novel deep learning model
for automated feature extraction from EEG signals to reduce the time, effort
and complexity. From the literature study, it has been found that several ML
techniques have been applied to extract features from EEG automatically but a
proper comparative study on traditional and automatic feature extraction methods
have not been put forward. In this paper, a deep learning model, convolutional
neural network autoencoder (CNN-AE) is proposed for automatic feature extraction.
These automated features are evaluated with several classification algorithms and
compared with manual feature extraction technique for comparative analysis and
feature optimisation.
74
Paper A1
The rest of the paper has been organised as follows– the background of the
research domain and several related works, are described in Section A1.2. Section
A1.3 contains detailed description of the experimental setup, data collection, analysis,
feature extraction and classification techniques. Results along with the discussions
are provided in Section A1.4 and A1.5 respectively. In the conclusion, limitations
and future of this work are discussed in Section A1.6.
75
XAI for Enhancing Transparency in DSS
deep convolutional neural network (CNN) for unsupervised feature learning from
EEG signals after applying data normalisation for preprocessing. To assess the
performance of their proposed model, several classification algorithms were used
to classify epilepsy patients. In several works, authors used stacked denoising
autoencoder (SDAE) (Yin & Zhang, 2016), long short-term memory (LSTM)
(Manawadu et al., 2018) and deep belief network (DBN) (Li et al., 2015) for feature
extraction after applying PSD for preprocessing. Guo et al. (2011) extracted features
by deployed genetic algorithm for classifying epilepsy with k-NN classifier. In this
approach, discrete wavelet transformation (DWT) was used for preprocessing of
raw EEG signals. Saha et al. (2018) investigated two different DL models, SDAE
and LSTM, for extracting features from EEG signals without any preprocessing.
Afterwards, MLP was used to classify cognitive load on the participants who were
asked to perform learning task. Ayata et al. (2017) and Almogbel et al. (2018), both
the research groups used CNN autoencoder (CNN-AE) for extracting features from
EEG signals for classifying arousal and MWL among participants.
Evidently, feature extraction from EEG signals using CNN-AE have been a
popular technique among researchers for classification tasks from epilepsy and MWL
domain. Moreover, several classification algorithms were further used to measure the
effectiveness of the features extracted automatically. But, to our knowledge none of
the works represented a comparative study about feature extraction through manual
analysis and automatic extraction of features using DL techniques to compare the
performance in workload classification particularly for driving situations.
76
Paper A1
Figure A1.1: The experimental circuit is about 2.5 kilometres long along Bologna
roads. The red and yellow line along the route indicates Hard and Easy segments of
the road, respectively. The green arrow in the bottom-right corner shows the direction
of driving from the starting and finishing point. © Springer Nature Switzerland AG
2019.
Normal and Rush. This factor had been designed following the General Plan of
Urban Traffic of Bologna, Italy. Table A1.1 refers the traffic flow intensity considered
to design two experimental conditions in this study.
Table A1.1: Traffic flow intensity in the experimental area during a day retrieved
from General Plan of Urban Traffic of Bologna, Italy. © Springer Nature Switzerland
AG 2019.
Rush Hour
Total Hour Normal Hour
Transits Morning Afternoon
14h (6 ÷ 20) 12h
(1230-1330) (1630-1730)
Total 19385 2024 2066 15295
Frequency – 2024 2066 12746
77
XAI for Enhancing Transparency in DSS
on the scalp according to the 10–20 International System. The sampling frequency
was 256 Hz for recording EEG signals. All the electrodes were referenced to both the
earlobes and grounded to the Cz site. Impedance was kept below 20 kΩ. During the
experiment, no signal conditioning was done; all the EEG signal processing was done
offline. Events were recorded along with EEG signals to associate specific signals to
different road and hour conditions.
Raw EEG signals were cropped referencing the events recorded; three laps for
both Normal & Rush hours including Easy & Hard conditions. Furthermore, two
ROAD-HOUR driving situations; Easy-Normal and Hard-Rush were selected for the
classification of MWL since literature suggests that these conditions demand low
and high MWL respectively (Di Flumeri et al., 2018). Data of all the laps driven
by the participants in the Easy-Normal and the Hard-Rush conditions were used for
further analysis. EEG signals were sliced into 2s (epoch length) segments by sliding
window technique with a stride of 0.125s, keeping an overlap of 1.875s between two
continuous epochs. The windowing technique was performed to obtain a higher
number of observations in comparison with the number of variables and respecting
the condition of stationarity of the EEG signals (Elul, 1969). Specific procedures of
EEGLAB toolbox (Delorme & Makeig, 2004) have been used for slicing the recorded
EEG signals. To remove different artefacts, i.e., ocular and muscle movements etc.
from the raw EEG signals, the ARTE algorithm by Barua et al. (2018) has been
used.
78
Paper A1
Welch’s method (Solomon, 1991) with Blackman-Harris window function was used
on the same length of the epochs (2s, 0.5 Hz frequency resolution). In particular, only
the theta band (5–8 Hz) over the EEG frontal channels and the alpha band (8–11 Hz)
over the EEG parietal channels were considered as variables for the mental workload
evaluation (Aricò et al., 2017). Then, to define EEG frequency bands of interest,
IAF values were estimated with the algorithm developed by Corcoran et al. (2018).
Figure A1.2 illustrates the final feature vector generation for each of the observations
following the aforementioned sequence of steps.
Figure A1.2: Steps in the traditional feature extraction technique. © Springer Nature
Switzerland AG 2019.
Figure A1.3: Network architecture of the CNN-AE for feature extraction. © Springer
Nature Switzerland AG 2019.
Deep Learning Approach. The CNN-AE architecture used for automatic feature
extraction is shown in Figure A1.3. The whole network is divided into two parts, i)
encoder and ii) decoder. Encoder is comprised of a number of convolutional layer
associated with pooling layers, finds deep hidden features from original signal. On the
other hand, Decoder uses several deconvolutional layer to reconstruct the signal from
the features. To assess the performance of the encoders, the quality of reconstructed
signal from decoder is used. On the basis of this compressing and reconstructing,
79
XAI for Enhancing Transparency in DSS
the whole model is trained. The developed encoder in this study, consists of four
convolutional layers and four max-pooling layers. The decoder is designed in inverse
order of the encoder. It contains five convolutional layers and four upsampling layers
facilitating the depooling. Zero padding, batch normalisation and ReLU activation
function have been used in each of the layers. The developed CNN-AE utilised
RMSprop optimisation with a learning rate of 0.002 and binary cross-entropy as the
loss function. After a successful learning procedure, CNN-AE extracted 284 features
from the experimental EEG signals.
Before classifying MWL, to reduce the dimension of the feature set further, feature
importance was calculated using RF classifier. Different number of features were
selected from 284 features depending on different threshold values and deployed for
80
Paper A1
classifying MWL with SVM classifier on the training data set. It was observed that
there was variation in accuracy. Finally, by imposing 0.003 threshold on feature
importance 124 relevant features were finalised that reduced the feature set by more
than half but increased accuracy. For the both the classifiers, parameters given in
Table A1.3 were used. Figure A1.4 illustrates the change of accuracy for different
threshold values of feature importance to select features for classification.
All the observations with relevant features from the EEG signals were divided into
training and testing sets considering 80% and 20% of the data, respectively. The
training set was used to train the model, and the testing set was used to validate the
accuracy of MWL classification. Several common classifiers stated in Table A1.3 were
deployed to verify the effectiveness of the features obtained by the traditional method
and CNN-AE. For measuring classification performance, average overall accuracy,
balanced classification rate (BCR) or balanced accuracy and F1 score were calculated
for each of the classifiers and features extracted by different methods. Tables A1.4 and
A1.5 contain the values for performance measures of classification from traditionally
extracted features and CNN-AE extracted features, respectively. It has been observed
that features extracted from CNN-AE produced better performance measures for all
the classifiers. In particular, SVM classified MWL with the highest overall accuracy
of 87%.
Table A1.4: Average performance measures of classifiers applied on traditionally
extracted features. © Springer Nature Switzerland AG 2019.
81
XAI for Enhancing Transparency in DSS
on the cross validations are illustrated in Figure A1.6 and A1.7 where the SVM
classifier has the highest AUC in both. For 10-fold cross validation, all the
observations were divided into 10 segments. Afterwards, for each iteration, one
segment was used for testing a model built on other segments as the training set. In
the leave-one-participant-out cross validation process, for each of the participants
of the experiment, the observations from that participant were used for testing
the model built on the observations from other participants considered as training
data. For both the cross validation, AUC values for CNN-AE extracted features in
classification are notably higher than the values for traditionally extracted features.
Figure A1.6: AUC-ROC curves for different classifiers with features extracted by
traditional methods and CNN-AE where models were trained using 10-fold cross
validation. © Springer Nature Switzerland AG 2019.
82
Paper A1
Figure A1.7: AUC-ROC curves for different classifiers with features extracted by
traditional methods and CNN-AE where models were trained using leave-one-out
(participant) cross validation. © Springer Nature Switzerland AG 2019.
A1.5 Discussion
In this study, traditional and CNN-AE based EEG feature extraction methods were
comparatively investigated using four well established classifiers; SVM, kNN, RF and
MLP. Among the concerned feature extraction techniques, CNN-AE influenced the
classifiers to achieve higher classification accuracy and other performance measures.
Initially, the number of features extracted from CNN-AE were substantially higher
than the features extracted through traditional methods but with feature selection
mechanism, the feature set was approximately reduced to half resulting improvement
in the accuracy measures of all classifiers. From different performance measures
demonstrated in Section A1.4, it has been shown that SVM achieves higher accuracy
in classifying MWL from EEG signals irrespective of feature extraction technique.
In case of classifier models for MWL classification used in related works, many
factors affect the performance of the model. Generally, if there remains a clear
correlation between characteristics of data and class labels, the deployed classifier
achieves higher accuracy in prediction. But, in case of MWL classification for drivers’
while driving in real life or simulator, the probability of noise being recorded with
the EEG signals is quite high due to eye movement, power signals, miscellaneous
interference etc. In practice, the noises are termed as artefacts. In traditional feature
extraction methods, removing these artefacts from data along with different inter-
and intra-individual variability require huge manual effort and processing. According
to the characteristics of deep learning, its layer can find out hidden features laid in
a data responsible of assigned labels. Here, from the results of this study it can be
established that, CNN-AE or any deep learning mechanism can produce feature set
from EEG signals, that would be equivalent or better than the feature set extracted
manually with less effort keeping aside the preprocessing and artefact handling tasks.
Primarily, the proposed CNN-AE produced an extensive set of features. An intuitive
investigation on the feature selection with RF Classifier and imposing threshold
on feature importance produced considerably shorter feature vector with higher
83
XAI for Enhancing Transparency in DSS
A1.6 Conclusion
This paper presents a new hybrid approach for automatic feature extraction from the
EEG signals and demonstrated with MWL classification. The main contribution of
this paper can be represented in three folds: i) CNN method is used to extract features
automatically from artefact handled EEG signals, ii) RF is used for feature selection
and iii) several machine learning algorithms are used to classify drivers’ mental
workload on CNN based feature sets. This new hybrid approach is compared with
traditional feature extraction approach considering four machine learning classifiers,
i.e., SVM, kNN, RF and MLP. According to the outcome of the both 10-fold and
leave-one-participant-out cross validation, SVM outperforms other classifiers with
CNN-AE extracted features. One advantage of CNN-AE for feature extraction is that
it works directly on the artefact handled data sets, i.e., additional signal processing,
individual feature extraction etc. are not needed, thus reducing time in manual
work. More experimental work with large and heterogeneous data set is planned for
future work to increase the performance of the proposed method and extract features
directly from raw EEG signals. Moreover, classifying MWL in real time using the
proposed approach and suggesting external actions to mitigate road casualty is the
final goal of the planned research works.
84
Paper A1
Bibliography
Ahmad, R. F., Malik, A. S., Kamel, N., Amin, H., Zafar, R., Qayyum, A., &
Reza, F. (2014). Discriminating the Different Human Brain States with
EEG Signals using Fractal Dimension- A Nonlinear Approach. 2014 IEEE
International Conference on Smart Instrumentation, Measurement and
Applications (ICSIMA), 1–5.
Almahasneh, H., Kamel, N., Walter, N., & Malik, A. S. (2015). EEG-based Brain
Functional Connectivity during Distracted Driving. 2015 IEEE International
Conference on Signal and Image Processing Applications (ICSIPA), 274–277.
Almogbel, M. A., Dang, A. H., & Kameyama, W. (2018). EEG-Signals Based
Cognitive Workload Detection of Vehicle Driver using Deep Learning.
2018 20th International Conference on Advanced Communication Technology
(ICACT), 256–259.
Aricò, P., Borghini, G., Di Flumeri, G., Colosimo, A., Pozzi, S., & Babiloni, F. (2016).
A Passive Brain–Computer Interface Application for the Mental Workload
Assessment on Professional Air Traffic Controllers during Realistic Air Traffic
Control Tasks. In D. Coyle (Ed.), Progress in Brain Research (pp. 295–328).
Elsevier.
Aricò, P., Borghini, G., Di Flumeri, G., Sciaraffa, N., Colosimo, A., &
Babiloni, F. (2017). Passive BCI in Operational Environments: Insights,
Recent Advances, and Future Trends. IEEE Transactions on Biomedical
Engineering, 64 (7), 1431–1436.
Ayata, D., Yaslan, Y., & Kamasak, M. (2017). Multi Channel Brain EEG Signals
based Emotional Arousal Classification with Unsupervised Feature Learning
using Autoencoders. 2017 25th Signal Processing and Communications
Applications Conference (SIU), 1–4.
Barua, S. (2019). Multivariate Data Analytics to Identify Driver’s Sleepiness,
Cognitive Load, and Stress (PhD Thesis). Mälardalen University.
Barua, S., Ahmed, M. U., Ahlstrom, C., Begum, S., & Funk, P. (2018). Automated
EEG Artifact Handling With Application in Driver Monitoring. IEEE
Journal of Biomedical and Health Informatics, 22 (5), 1350–1361.
Barua, S., Ahmed, M. U., & Begum, S. (2017). Classifying Drivers’ Cognitive Load
Using EEG Signals. In B. Blobel & W. Goossen (Eds.), Proceedings of the
14th International Conference on Wearable Micro and Nano Technologies for
Personalized Health (pHealth) (pp. 99–106). IOS Press.
Begum, S., & Barua, S. (2013). EEG Sensor Based Classification for Assessing
Psychological Stress. In B. Blobel, P. Pharow, & L. Parv (Eds.), Proceedings
of the 10th International Conference on Wearable Micro and Nano
Technologies for Personalized Health (pHealth) (pp. 83–88). IOS Press.
Begum, S., Barua, S., & Ahmed, M. U. (2017). In-Vehicle Stress Monitoring
Based on EEG Signal. International Journal of Engineering Research and
Applications, 07 (07), 55–71.
Brookhuis, K. A., & de Waard, D. (2010). Monitoring Drivers’ Mental Workload
in Driving Simulators using Physiological Measures. Accident Analysis &
Prevention, 42 (3), 898–903.
Charles, R. L., & Nixon, J. (2019). Measuring Mental Workload using Physiological
Measures: A Systematic Review. Applied Ergonomics, 74, 221–232.
85
XAI for Enhancing Transparency in DSS
86
Paper A1
Paxion, J., Galy, E., & Berthelon, C. (2014). Mental Workload and Driving. Frontiers
in Psychology, 5.
Saha, A., Minz, V., Bonela, S., Sreeja, S. R., Chowdhury, R., & Samanta, D. (2018).
Classification of EEG Signals for Cognitive Load Estimation Using Deep
Learning Architectures. In U. S. Tiwary (Ed.), Intelligent Human Computer
Interaction. IHCI 2018. Lecture Notes in Computer Science (pp. 59–68).
Springer International Publishing.
Sakai, M. (2013). Kernel Nonnegative Matrix Factorization with Constraint
Increasing the Discriminability of Two Classes for the EEG Feature
Extraction. 2013 International Conference on Signal-Image Technology &
Internet-Based Systems (SITIS), 966–970.
Sherwani, F., Shanta, S., Ibrahim, B. S. K. K., & Huq, M. S. (2016). Wavelet based
Feature Extraction for Classification of Motor Imagery Signals. 2016 IEEE
EMBS Conference on Biomedical Engineering and Sciences (IECBES),
360–364.
Solomon, O. M. (1991). PSD Computations using Welch’s Method (tech. rep.). Sandia
National Laboratories. Washington, DC, USA.
Tharwat, A. (2021). Classification Assessment Methods. Applied Computing and
Informatics, 17 (1), 168–192.
Thomas, P., Morris, A., Talbot, R., & Fagerlind, H. (2013). Identifying the Causes of
Road Crashes in Europe. Annals of Advances in Automotive Medicine, 57,
13–22.
Tzallas, A., Tsipouras, M., & Fotiadis, D. (2009). Epileptic Seizure Detection in
EEGs Using Time-Frequency Analysis. IEEE Transactions on Information
Technology in Biomedicine, 13 (5), 703–710.
Verwey, W. B. (2000). On-line Driver Workload Estimation. Effects of Road Situation
and Age on Secondary Task Measures. Ergonomics, 43 (2), 187–209.
Wen, T., & Zhang, Z. (2017). Effective and Extensible Feature Extraction
Method using Genetic Algorithm-based Frequency-domain Feature Search
for Epileptic EEG Multiclassification. Medicine, 96 (19), e6879.
Wen, T., & Zhang, Z. (2018). Deep Convolution Neural Network and
Autoencoders-Based Unsupervised Feature Learning of EEG Signals. IEEE
Access, 6, 25399–25410.
Yin, Z., & Zhang, J. (2016). Recognition of Cognitive Task Load levels using single
channel EEG and Stacked Denoising Autoencoder. Proceedings of the 35th
Chinese Control Conference (CCC), 3907–3912.
Zarjam, P., Epps, J., & Chen, F. (2011). Spectral EEG Features for Evaluating
Cognitive Load. 2011 Annual International Conference of the IEEE
Engineering in Medicine and Biology Society (EMBC), 3841–3844.
Zarjam, P., Epps, J., & Lovell, N. H. (2015). Beyond Subjective Self-Rating:
EEG Signal Classification of Cognitive Workload. IEEE Transactions on
Autonomous Mental Development, 7 (4), 301–310.
87
Paper A2 A2
Islam, M. R., Barua, S., Ahmed, M. U., Begum, S., Aricò, P., Borghini, G.
& Di Flumeri, G.
Paper A2
Abstract
Analysis of physiological signals, electroencephalography more specifically,
is considered a very promising technique to obtain objective measures for
mental workload evaluation, however, it requires a complex apparatus to
record, and thus, with poor usability in monitoring in-vehicle drivers’ mental
workload. This study proposes a methodology of constructing a novel mutual
information-based feature set from the fusion of electroencephalography
and vehicular signals acquired through a real driving experiment and
deployed in evaluating drivers’ mental workload. Mutual information
of electroencephalography and vehicular signals were used as the prime
factor for the fusion of features. In order to assess the reliability of the
developed feature set mental workload score prediction, classification and
event classification tasks were performed using different machine learning
models. Moreover, features extracted from electroencephalography were
used to compare the performance. In the prediction of mental workload
score, expert-defined scores were used as the target values. For classification
tasks, true labels were set from contextual information of the experiment. An
extensive evaluation of every prediction tasks was carried out using different
validation methods. In predicting the mental workload score from the
proposed feature set lowest mean absolute error was 0.09 and for classifying
mental workload highest accuracy was 94%. According to the outcome of
the study, it can be stated that the novel mutual information based features
† © 2020 by the Authors (CC BY 4.0). Reprinted from Islam, M. R., Barua, S., Ahmed, M. U.,
Begum, S., Aricò, P., Borghini, G., & Di Flumeri, G. (2020). A Novel Mutual Information Based
Feature Set for Drivers’ Mental Workload Evaluation Using Machine Learning. Brain Sciences,
10 (8), 551.
91
XAI for Enhancing Transparency in DSS
A2.1 Introduction
92
Paper A2
In this context, this study further investigated the possible association between
vehicular and EEG signals and their relationship with the MWL of drivers while
driving. In particular, the present work validates the fusion of mentioned signals
with the aim to develop feature set that can be used for in-vehicle drivers’ MWL
evaluation with a provision for reducing the complexity of recording EEG signals
repeatedly in the concerned tasks. The aim of this study can be outlined as:
93
XAI for Enhancing Transparency in DSS
94
Paper A2
95
XAI for Enhancing Transparency in DSS
96
Paper A2
while the data recorded during the second and third laps were taken into account for
the analysis. Figure A2.1 illustrates the overview of the experimental protocol.
Figure A2.1: Summary of the experimental protocol. The experiment was carried
out with two driving tasks, which were different in terms of traffic (Normal and Rush
hour), and performed in a randomized order. Each of the driving tasks was comprised
of three laps: The 1st lap was intended to make the driver habituated to the circuit, and
the other (2nd and 3rd) laps were used for analysis. Moreover, events were introduced
in the 3rd lap to assess the presence of different scenarios on the road when they are
absent and present respectively. © 2020 by Islam et al. (CC BY 4.0).
97
XAI for Enhancing Transparency in DSS
98
Paper A2
99
XAI for Enhancing Transparency in DSS
With the availability of the vehicular data, its nature was investigated at the
group level with respect to different traffic situations, road conditions, presence of
events and type of events. Moreover, the change in MWL of drivers were also studied
alongside and prominent trend of changes were observed. In the exploratory analysis,
comparison of mean values and two-sided Wilcoxon signed-rank tests (Wilcoxon,
1992) were performed considering the null hypothesis, H0 : there is no difference
between the observations of the two measurements and the alternate hypothesis, H1 :
the observations of the two measurements are not equal, with level of significance of
0.05. Figure A2.2 illustrates the change in drivers’ average MWL score and velocity
in different traffic hour and road conditions along with the standard deviations. A
two-sided Wilcoxon signed-rank test was used to analyze the MWL of drivers on Easy
and Hard segments of the track to test if the change in segment had a significant effect
on the MWL. Drivers’ MWL while driving on the Easy segment was lower (0.42±0.32)
compared to the Hard segment (0.51 ± 0.27); there was a statistically significant
increase in blood pressure (t = 0.0, p = 0.012). Conversely, on the Easy segment of
the track, participating drivers maintained average velocity 44.69 ± 14.21 kilometers
per hour (km/h) whereas the average velocity dropped to 37.81 ± 11.83 km/h on the
Hard segment. A two-sided Wilcoxon signed-rank test on the driving velocities of all
the participants for the Easy and Hard segments produced t = 0.0, p = 0.12, which
signifies the difference of velocity due to different road segments. A similar trend of
increasing MWL was observed while drivers drove during Normal (0.40 ± 0.26) and
Rush (0.45 ± 0.34) hours. A two-sided Wilcoxon signed-rank test on drivers’ MWL
for driving during different hours produced t = 3.0, p = 0.036, signifying the change
in MWL. On the other hand, average driving velocity during Normal hour was 42.39
± 13.70 km/h, which reduced to 40.98 ± 13.57 km/h in Rush hour. According to
the result of a two-sided Wilcoxon signed-rank test (t = 14.0, p = 0.575), there were
no significant difference between driving velocity during Normal and Rush hour.
(a) (b)
Figure A2.2: Average MWL score and velocity of nine participating drivers in different
(a) road segments and (b) traffic hours. The standard deviations are indicated, the
p-values obtained from the two-sided Wilcoxon signed-rank tests are presented and
significant values at 5% confidence interval are marked with asterisks (*). © 2020 by
Islam et al. (CC BY 4.0).
Two different events; a car and a pedestrian, were introduced during the 3rd
lap driving with a view to mimic the general road users and observe their effect on
100
Paper A2
(a) (b)
Figure A2.3: Average MWL score and velocity with standard deviation calculated
from the data of nine participating drivers with respect to events. Sub-figure (a)
illustrates the variation of MWL score and velocity with/without the presence of
events and Sub-figure (b) illustrates the effect of car and pedestrian on MWL score
and velocity. The p-values obtained from the two-sided Wilcoxon signed-rank tests are
presented and significant values at 5% confidence interval are marked with asterisks
(*). © 2020 by Islam et al. (CC BY 4.0).
101
XAI for Enhancing Transparency in DSS
Reduction in class uncertainty after having observed the variable vector x is called
the mutual information between X and Y , same as the Kullback-Leibler divergence
between the joint density p(y, x) and its factored form p(y)p(x).
We derived a template for producing feature set using solely vehicular signal using
Corollary A2.3.1.1 which was derived from Theorem A2.3.1.
Corollary A2.3.1.1. Given a continuous random variable E representing EEG
observations and a continuous random variable V representing vehicular signals,
from a specific population distribution and representing the objective and indirect
measure of MWL, respectively. The mutual information I(E, V ) between variables E
and V represents the mutual dependency between them by quantifying the amount of
information they share collectively for estimating MWL, which can be derived using
corresponding variable vectors e and v.
In association with the Corollary A2.3.1.1, for better visualization, Figure A2.4
illustrates the concept of MI with respect to the variables used in this study. E
for EEG and V for vehicular data are depicting X and Y as described in Theorem
A2.3.1. Entropy value for vehicular data and EEG are represented with H(V ) and
H(E). Joint entropy H(E, V ) consists of the union of the entropy spaces and mutual
information I(E, V ) in the intersecting space. Thus, H(E, V ) = H(E) + H(V ) −
I(E, V ) is derived using Set Theory. e and v represent a single instance of EEG and
vehicular signal, respectively, and m represents a single instance of I(E, V ), which
is the mutual information shared by single instances e and v. Formally, I(E, V ) is a
102
Paper A2
matrix of order p × q, where p and q are the numbers of vehicular and EEG features
respectively. Each row of the matrix represents the shared information between a
single vehicular feature and every EEG feature. Furthermore, ||I(E, V )||, the norm
of each row of I(E, V ) was calculated which is a vector containing the collective
magnitude of the shared information between each vehicular feature and all EEG
features. The ||I(E, V )|| was further used to calculate new MI-based feature vector
m′ from vehicular features entirely with the following Equation A2.8 where v ′ is a
new instance vector of vehicular features.
Figure A2.4: Illustration of shared information between EEG and vehicular signal
spaces. © 2020 by Islam et al. (CC BY 4.0).
In the extraction of MI-based features from the data of this particular study,
data were represented in vector forms, i.e., e for EEG and v vehicular data, which
belongs to the domains E and V , respectively. Formally, E, V ∈ Rd , where d bears
45 and 4, respectively, for this study. For this specific analysis, the EEG signal
was analyzed again. In fact, in the previous section of the study we employed a
well-established approach, even patented (Aricò, Borghini, Di Flumeri, & Babiloni,
2017), to obtain the EEG-based MWL reference measurements (Di Flumeri et
al., 2015; Aricò, Borghini, Di Flumeri, Colosimo, Pozzi, et al., 2016; Di Flumeri
et al., 2018). In that case, a specific a priori hypothesis (only frontal Theta
and parietal Alpha features) and processing procedures (e.g., automatic artifacts
correction/removal) were necessary for the classification algorithm reliability and
the possibility of employing it even online (Aricò, Borghini, Di Flumeri, Colosimo,
Bonelli, et al., 2016). In this second analysis, because of the absence of these
restrictions, we preferred to employ more complex artifacts rejection algorithms
and to enlarge features domain all the EEG channels throughout the scalp were
considered while extracting the features. At first the raw EEG data were cleaned,
i.e., the artefacts were removed using ARTE (Automated aRTifacts handling in
EEG) (Barua et al., 2018) and subsequently, 45 features were extracted from power
spectral density values. The IAF value was determined as the peak of the general
alpha rhythm frequency (8–12 Hz). Subsequently, the average frequency of the theta
band [IAF − 6, IAF − 2], the alpha band [IAF − 2, IAF + 2] and the beta band
[IAF + 2, IAF + 18], over all the EEG channels were calculated. Table A2.1 shows
the mapping between the features are frequency rhythms. On the other hand, the
vehicular signal was resampled to the sampling frequency of the EEG signals in order
103
XAI for Enhancing Transparency in DSS
to synchronize and generate equal number of data points to analyze. The steps of the
process are as follows: the vehicular signal at 10 Hz was at first upsampled by 256.
After that, a zero-phase low-pass finite impulse response (FIR) filter was applied and
then the signal was downsampled by 10. As a result, the resulting sample rate became
256 Hz, i.e., 256/10 times the original sample rate 10 Hz. The vehicular feature set
contains the values for velocity, acceleration, lateral and longitudinal acceleration
signals. Finally, values of all the features gathered from vehicular and EEG signals
were normalized with the min–max feature scaling within the range 0 to 1, in order
to restrict the ML algorithms to pick up unimportant characteristics from the data
due to difference in values of different features.
Table A2.1: Mapping among different EEG channels, three significant frequency
rhythms and identifications (ID) of features. Each row represents the IDs of the features
extracted from specific frequency rhythms from the EEG channels mentioned in the
table head. © 2020 by Islam et al. (CC BY 4.0).
Rythms F P z F z P z P Oz Oz AF 3 AF 4 F3 F4 P3 P4 P5 P6 O1 O2
theta (θ) 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
alpha (α) 2 5 8 11 14 17 20 23 26 29 32 35 38 41 44
beta (β) 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45
Considering all of the available vehicular features and the calculated features
from EEG signal, MI values were calculated using Equation A2.7. The associated
MI values, illustrated in Figure A2.5, demonstrates the shared knowledge between
vehicular data and EEG data. Though the range of the MI values are not significant,
yet they share some information, which leverages the motivation to use MI values in
further classification or quantification of MWL in this work. Finally, an MI-based
feature set was constructed using Equation A2.8. Table A2.2 represents the number
of features from different feature sets which were considered in further stages of this
study. Here, the prime concern of the study is to investigate the performance of
MI-based features in MWL assessment and EEG features are used as an established
objective measure reference.
Figure A2.5: Calculated MI values between EEG and vehicular signal. The columns
of the matrix correspond to 45 features extracted from EEG signals, and the rows
correspond to four vehicular features: Velocity (Velo), Acceleration (Acce), Lateral
Acceleration (LatA) and Longitudinal Acceleration (LonA). The colour bar below
illustrates the range of values for each pair of EEG and vehicular features, where dark
blue on the left corresponds to low mutual information and gradually higher mutual
information values towards the right are represented by yellow. © 2020 by Islam et al.
(CC BY 4.0).
104
Paper A2
Table A2.2: List of different feature sets and corresponding number of features used
for validating the proposed methodology. © 2020 by Islam et al. (CC BY 4.0).
105
XAI for Enhancing Transparency in DSS
the continuous value or target value. SVM-based regression and classification models
have a very good generalization capability on multidimensional data and dynamic
classification/prediction scheme, which makes them appropriate for the concerned
tasks. Moreover, literature shows deliberate use of SVM in the domain of EEG
signal analysis and MWL assessment (Saccá et al., 2018; Saha et al., 2018; Wei et
al., 2018). In this study, for all tasks, the SVM was configured with Radial Basis
Function (RBF) kernel with degree 3. By trial and error, the final regularization
parameter C was set to 1.0 and epsilon to 0.2 as the model parameters.
The trained ML models were further deployed in performing different tasks to
evaluate the MI-based features. The model parameters used to train different models
for respective tasks are summarised in Table A2.3.
Table A2.3: Parameters used in building different models for prediction and
classification tasks. © 2020 by Islam et al. (CC BY 4.0).
106
Paper A2
stratified sampling as a holdout test set. The rest of the data were further used for
training and validating the models using the two described validation methods with
a view to flag problems like overfitting or selection bias.
The tasks of implementation of the proposed methodology and representation of
result were done using Python (van Rossum, 1995) and R (R Core Team, 2013)
environments. Python libraries NumPy (Travis, 2015) and Pandas (McKinney,
2010) were invoked for preparing the data. ML models were trained, validated and
tested using the Scikit Learn (Pedregosa et al., 2011) library for Python. The plots
and graphs were drawn utilizing different methods of Matplotlib (Hunter, 2007).
Statistical tests were conducted mostly using methods from SciPy (Virtanen et al.,
2020) library for Python and pROC (Robin et al., 2011) package for R.
A2.4 Results
The outcome of the performed study is presented from the viewpoint of two different
tasks: prediction and classification. In the process, the developed prediction models
were evaluated using Mean Absolute Error (MAE) and Mean Standard Error (MSE).
The evaluation of the developed MWL and event classifiers were done in terms
of confusion matrices, Receiver Operating Characteristic (ROC) curves, accuracy,
sensitivity and specificity. In addition to the mentioned performance measures,
balanced accuracy was also measured since both of the classification task of this study
were binary classification and due to division of epochs from the signal recordings
and duration of driving, the number of instances representing each class varied to
some extent.
107
XAI for Enhancing Transparency in DSS
(a) (b)
Figure A2.6: The 10-fold Cross Validation (CV) score in terms of Mean Absolute
Error (MAE) for regression models: (a) Linear Regression (LnR) and (b) Multilayer
Perceptron (MLP), where the expert derived MWL scores were considered as true
values. For each of the models, two different sets of features were used (Table A2.2).
© 2020 by Islam et al. (CC BY 4.0).
(a) (b)
Figure A2.7: The 10-fold CV score in terms of MAE for regression models: (a)
Random Forest (RF) and (b) Support Vector Machine (SVM), where the expert derived
MWL scores were considered as true values. For each of the models, two different sets
of features were used (Table A2.2). © 2020 by Islam et al. (CC BY 4.0).
H0 : µM I = µEEG (A2.9)
H1 : µM I > µEEG (A2.10)
108
Paper A2
Table A2.4: The 10-fold CV summary in terms of Mean Absolute Error and
Mean Squared Error for predicting MWL score using EEG and Mutual Information
(MI)-based features. © 2020 by Islam et al. (CC BY 4.0).
MAE MSE
Model Features
Minimum Maximum Average Minimum Maximum Average
EEG-based 0.11 0.22 0.16 0.02 0.07 0.04
LnR
MI-based 0.09 0.23 0.16 0.02 0.07 0.04
EEG-based 0.09 0.22 0.17 0.02 0.07 0.04
MLP
MI-based 0.10 0.22 0.16 0.02 0.06 0.04
EEG-based 0.11 0.22 0.16 0.02 0.07 0.04
RF
MI-based 0.10 0.22 0.16 0.02 0.07 0.04
EEG-based 0.12 0.23 0.17 0.03 0.07 0.04
SVM
MI-based 0.11 0.21 0.17 0.02 0.06 0.04
The result is summarised in Table A2.5, where it can be observed that, while
classifying MWL, only SVM achieved significantly higher performance while trained
with the MI-based features. On the other hand, all the classifiers performed better
while trained with MI-based features than EEG-based features in classifying events.
Classifiers
Tasks LgR MLP RF SVM
t p t p t p t p
MWL Classification 0.0 0.994 9.0 0.896 0.0 0.994 36.0 0.006*
Event Classification 32.0 0.025* 36.0 0.006* 33.0 0.018* 36.0 0.006*
ROC curves associated with Area Under the Curve (AUC) values, for both of
the classification tasks are illustrated in Figure A2.8. The ROC curves were drawn
for the holdout test set. In both of the tasks, from the overall perspective, RF
classifier outperformed other classifiers with both feature sets in terms of AUC values.
Specifically, in MWL classification, the accuracy was higher for using EEG-based
feature set but in event classification MI-based feature set produced higher AUC
value.
In addition to the calculated AUC values from different performance metrics,
95% Confidence Interval (CI) of true AUC, Z and p values were extracted from
Delong’s test (DeLong et al., 1988) of comparing AUC values. To conduct the test,
the null hypothesis was set as H0 and the alternative hypothesis, H1 : the values of
AUC for classifiers trained on MI-based features are higher than the values of AUC
for classifiers trained on EEG-based features. Table A2.6 presents the results of
DeLong’s test, which is similar to the results obtained from the one-sided Wilcoxon
signed-rank test outlined in Table A2.5 in terms of rejecting the null hypothesis H0
with significance level 0.05.
109
XAI for Enhancing Transparency in DSS
(a) (b)
Figure A2.8: Receiver Operating Characteristic (ROC) curves for the best two
classifier models among Logistic Regression (LgR), MLP, SVM and RF. The classifiers
were deployed in two different binary classification tasks: (a) Low or High MWL and
(b) type of events – Car or Pedestrian. For each of the tasks, all the classifier models
were trained using 10-fold cross validation approach. © 2020 by Islam et al. (CC BY
4.0).
Table A2.6: Summary of DeLong’s test (DeLong et al., 1988) to compare Area
Under the Curve (AUC) values at significance level 0.05 (5.00 × 10−2 ). The values
were summarised for LgR, MLP, RF and SVM classifiers in different classification tasks
on the holdout test set. The significant values, i.e., p < 0.05, are marked with (*). ©
2020 by Islam et al. (CC BY 4.0).
The test classification report for MWL classification is presented in Table A2.7. In
addition to that, Table A2.8 provides the classification report on the holdout test set,
which demonstrates improvements in performance accuracy for classification using
MI-based features. To assess the solitary performance of classifiers trained with
MI-based features, the maximum accuracy achieved in different CV approach over
all the data splits were investigated. Figure A2.9 illustrates bar charts developed
with the maximum accuracy achieved by different classifiers in classifying MWL and
events with MI-based features. It can be observed that, in 10-fold CV, the highest
accuracy was 92.15% from RF classifier, whereas in event classification, SVM achieved
91.14%, which is the highest of all other classifiers while considering LOO-subject
CV.
110
Paper A2
Table A2.7: Performance summary of classifying Low and High MWL with LgR,
MLP, SVM and RF classifier models using EEG and MI-based feature on the holdout
test set. In this task, the total number of observations was 1710, where low MWL was
considered as the positive class. The number of observations with positive and negative
class were 917 and 793, respectively. The highest accuracies obtained by using different
feature sets are marked with (*). © 2020 by Islam et al. (CC BY 4.0).
(a) (b)
Figure A2.9: Maximum balanced accuracy in different CV method for MWL and
event classification using MI-based features by different classifier models: (a) 10-fold
CV and (b) Leave-One-Out (LOO)-subject CV. © 2020 by Islam et al. (CC BY 4.0).
A2.5 Discussion
An increase of secondary tasks, e.g., reaching for the mobile phone, interacting
with the mobile phone (touching on the screen, dialing and texting), talking,
reading the screen, glancing at the phone momentarily and talking or listening
to a hands-free device together with the primary task of driving causes increased
MWL. According to the state-of-the-art (SotA) approaches, to measure MWL,
Electroencephalography (EEG) has been proven to be a good parameter and widely
used in research (Begum & Barua, 2013; Aricò, Borghini, Di Flumeri, Colosimo,
Pozzi, et al., 2016; Aricò, Borghini, Di Flumeri, Sciaraffa, et al., 2017), although it is
not feasible enough in terms of data acquiring, processing and decision making while
111
XAI for Enhancing Transparency in DSS
Table A2.8: Performance summary of classifying Car and Pedestrian events with LgR,
MLP, SVM and RF classifier models using EEG and MI-based feature on the holdout
test set among 738 observations where events due to pedestrian were considered as
positive class. The number of observations with positive and negative class were 241
and 497 respectively. The highest accuracies obtained by using different feature sets
are marked with (*). © 2020 by Islam et al. (CC BY 4.0).
driving a car in naturalistic environment. So, the aim of this study is to perform
research and development to identify a methodology for constructing a novel mutual
information-based feature set from the fusion of electroencephalography and vehicular
signals and deployed in evaluating drivers’ mental workloads. In this study, EEG
and vehicular signals were recorded through driving experiment in real scenarios
that varies in different factors; “HOUR” and “ROAD” (Di Flumeri et al., 2018).
Here, two different events were also introduced to investigate the effects on drivers’
MWL. Since the experiment was conducted in a real environment, there might be
the presence/absence of other road users. The events leveraged the provision for
analyzing uniformly for all participants the effect of specific road users other than
the regular traffic on the road. According to the initial data analysis at group level, it
was observed that different situations and road users affect the MWL of drivers and
their vehicle handling. The results from the observation (Section A2.3.2) confirmed
the experimental hypothesis, i.e., “the driving task in terms of road complexity as well
as events induced differences in driving behaviours and drivers’ experienced MWL”.
Statistical hypothesis tests were conducted on average driving velocity and drivers’
MWL and significant (p < 0.05) differences were observed. The tests are described in
details in Section A2.3.2.3. In addition to that, several comparative plots were drawn
to assess the effects visually, which are illustrated in Figures A2.2 and A2.3. In short,
the comparisons pointed out that MWL and vehicle handling both changes when the
road condition or events on the road are altered. However, the effects of change in
events on MWL and driving behaviours are stronger than change in road condition.
These findings and together with prior literature review on use of advantages and
disadvantages of EEG features as a measure of MWL produced the base of further
analysis and increase the urge to utilize mostly vehicular features in association to
EEG for evaluating MWL of drivers.
112
Paper A2
To combine EEG features and vehicular features, a correlation between them were
calculated and the assessed values of the correlation coefficients were negligible. On
the contrary, prior investigations on the average driving velocity and MWL (Section
A2.3.2) showed changes while driving environments were varied (Section A2.3.2.3).
Thus, the motivation of exploiting MI between EEG and vehicular signal developed
entirely on the low correlation coefficient and conversely significant similarity in
the change of MWL and vehicular signal. Furthermore, the new novel concept of
utilizing MI was proposed. Here, the reference values of MI between two continuous
variables should be in the range [1, ∞] (Cover & Thomas, 2006). The MI is calculated
based on the relation between EEG and vehicular features where the average value
was found to be approximately 8.5, which is very low but not null. The data for
this study were recorded from a specific experiment from some specific participants,
which represented their brain activity and vehicle handling together for the respective
population distribution. However, The low MI values could be derived due to a
smaller number of vehicular features. Despite the fact that the MI values were low,
in MWL evaluation, the proposed features in some cases outperformed established
objective measures. If there were more vehicular features, there could be wider variety
of ways to mimic the handling of vehicle by the participants. As a result, systems
would attain higher performance in MWL evaluation. Experiments are underway to
increase the number of vehicular features by adding other parameters from inertial
measurement unit (IMU) devices.
One of the objectives of this study was to quantify MWL of drivers from the
proposed feature set. To test the performance of using the proposed feature set,
four different ML regression methods were investigated: LnR, MLP, RF and SVM,
considering the MWL score extracted by expert-defined methods as true values. For
the regression, the true values of MWL score fall in the range [0, 1], where 0 represents
no MWL and 1 represents highest from individual point of view (Di Flumeri et
al., 2018). For each of the regression models, the average MAE and MSE were
around 0.16 and 0.04 (Table A2.4). Again, these errors were compared with the
results of regression models trained using EEG-based features. In comparison, using
different features produced approximately similar errors while predicting MWL scores
of drivers and the comparison of MAE in 10-fold CV is illustrated in Figures A2.6
and A2.7. From the visualizations it was observed that the difference in average error
from RF regression model was lowest among the considered models, which might be
an effect of functional differences in terms of ensemble technique (Breiman, 2001), as
described in Section A2.3.4.
In addition to MWL quantification, the performances of MWL and event
classification using MI-based features were also examined against EEG-based
features. Classifier-wise average performance on MWL and event classification was
tested using a one-sided Wilcoxon signed-rank test (Wilcoxon, 1992). Unlike MWL
quantification, the average performance of SVM classifier with MI-based feature set
was significantly higher in both classification tasks (Table A2.5). According to Shah,
SVM is the most widely-used algorithm for classification tasks on the basis of features
extracted from EEG signals (Saha et al., 2018). The initial finding of this study
aligns with the statement. On the other hand, the other three classifiers: LgR, MLP
and RF performed better in event classification with MI-based features. To access
the correct binary classification capacity, AUC-ROC curves were plotted where RF
113
XAI for Enhancing Transparency in DSS
outperformed all other classifiers in terms of AUC values. Figure A2.8 illustrates
the AUC-ROC curves for RF and MLP classifiers that achieved the higher AUC
values while tested on the holdout set for simplicity. In addition to that, DeLong’s
test (DeLong et al., 1988) of comparing AUC values demonstrated similar significant
differences as the one-sided Wilcoxon signed-rank test (Wilcoxon, 1992) showed. It
can be observed from Table A2.6 that all the calculated AUC values are within the
95% confidence interval for true AUC values. Moreover, the values of Z and p are
consistent, i.e., in case of significant values of p, we accept the alternate hypothesis
that the values of AUC for classifiers trained on MI-based features are higher than
the values of AUC for classifiers trained on EEG-based feature and the signs of test
statistics, Z express the same relation between the AUC values. However, according
to the performance metrics, in MWL classification, RF achieved the highest AUC
value of 0.92 with accuracy 82% with MI-based features and the AUC value was 0.96
(Figure A2.8a) with accuracy 88% (Table A2.7) with EEG-based features. Again,
the performance on event classification (Car or Pedestrian) was evaluated with the
same ML algorithms considering both the feature sets. In event classification result,
RF with MI-based features with AUC value 0.98 outperformed EEG-based features
with AUC value 0.95 (Figure A2.8b). The accuracy on the test set in the classifying
event was found to be 94% by the RF classifier by using MI-based features, which is
the best performance achieved in this whole study (Table A2.8).
A2.6 Conclusion
In conclusion, the present study was carried out through a driving experiment in
a real environment, which was aimed at investigating the utilization of vehicular
signals in evaluation of MWL of drivers with a view to reduce the effort of using
EEG signals and eliminate the task of managing redundant EEG signal recording
apparatuses. This paper presents an MI-based feature set construction methodology
with the combination of EEG and vehicular signals. The feature set was deployed to
evaluate drivers’ MWL in terms of score and labels. Several ML models were trained
to perform the evaluation tasks. The values of MAE in MWL score prediction showed
that there was approximately no difference between the predicted score generated
using MI-based features and EEG features. On the other hand, in classification
tasks, it was observed that RF classifiers performed better than other classifiers
in labeling MWL and events in terms of performance metrics of ML models, but
through statistical tests it was observed that SVM performed significantly better
than all other classifiers. While classifying MWL, the highest accuracy observed
was 88% with EEG-based features and 82% with MI-based features. Furthermore,
using MI-based features outperformed EEG-based features in two specific events
(a pedestrian crossing the road and a car entering in the traffic flow) classification
with an accuracy of 94%. Though the accuracy in MWL classification from the
developed feature set was not equivalent to EEG features, the accuracy in event
classification urges the need of re-evaluation of the proposed fusion methodology of
feature extraction with higher number of vehicular features in future studies.
114
Paper A2
M.R.I.; methodology, M.R.I. and G.D.F.; resources, M.R.I., S.B. (Shaibal Barua) and
G.D.F; software, M.R.I.; supervision, M.U.A. and S.B. (Shahina Begum); validation,
M.R.I., S.B. (Shaibal Barua) and G.D.F.; visualization, M.R.I.; writing – original
draft preparation, M.R.I. and G.D.F.; writing – review and editing, M.R.I., S.B.
(Shaibal Barua), M.U.A., S.B. (Shahina Begum) and G.D.F. All authors have read
and agreed to the published version of the manuscript.
Bibliography
Ahmad, R. F., Malik, A. S., Kamel, N., Amin, H., Zafar, R., Qayyum, A., &
Reza, F. (2014). Discriminating the Different Human Brain States with
EEG Signals using Fractal Dimension- A Nonlinear Approach. 2014 IEEE
International Conference on Smart Instrumentation, Measurement and
Applications (ICSIMA), 1–5.
Almahasneh, H., Kamel, N., Walter, N., & Malik, A. S. (2015). EEG-based Brain
Functional Connectivity during Distracted Driving. 2015 IEEE International
Conference on Signal and Image Processing Applications (ICSIPA), 274–277.
Antonenko, P. D. (2007). The Effect of Leads on Cognitive Load and Learning
in a Conceptually Rich Hypertext Environment (PhD Thesis). Iowa State
University.
Aricò, P., Borghini, G., Di Flumeri, G., Colosimo, A., Pozzi, S., & Babiloni, F. (2016).
A Passive Brain–Computer Interface Application for the Mental Workload
Assessment on Professional Air Traffic Controllers during Realistic Air Traffic
Control Tasks. In D. Coyle (Ed.), Progress in Brain Research (pp. 295–328).
Elsevier.
Aricò, P., Borghini, G., Di Flumeri, G., & Babiloni, F. (2017). Method for
Estimating a Mental State, In particular a Workload, and Related Apparatus
(EP3143933A1).
115
XAI for Enhancing Transparency in DSS
Aricò, P., Borghini, G., Di Flumeri, G., Colosimo, A., Bonelli, S., Golfetti, A.,
Pozzi, S., Imbert, J.-P., Granger, G., Benhacene, R., & Babiloni, F. (2016).
Adaptive Automation Triggered by EEG-Based Mental Workload Index:
A Passive Brain-Computer Interface Application in Realistic Air Traffic
Control Environment. Frontiers in Human Neuroscience, 10, 539.
Aricò, P., Borghini, G., Di Flumeri, G., Sciaraffa, N., Colosimo, A., &
Babiloni, F. (2017). Passive BCI in Operational Environments: Insights,
Recent Advances, and Future Trends. IEEE Transactions on Biomedical
Engineering, 64 (7), 1431–1436.
Barua, S. (2019). Multivariate Data Analytics to Identify Driver’s Sleepiness,
Cognitive Load, and Stress (PhD Thesis). Mälardalen University.
Barua, S., Ahmed, M. U., Ahlstrom, C., Begum, S., & Funk, P. (2018). Automated
EEG Artifact Handling With Application in Driver Monitoring. IEEE
Journal of Biomedical and Health Informatics, 22 (5), 1350–1361.
Barua, S., Ahmed, M. U., & Begum, S. (2017). Classifying Drivers’ Cognitive Load
Using EEG Signals. In B. Blobel & W. Goossen (Eds.), Proceedings of the
14th International Conference on Wearable Micro and Nano Technologies for
Personalized Health (pHealth) (pp. 99–106). IOS Press.
Begum, S., & Barua, S. (2013). EEG Sensor Based Classification for Assessing
Psychological Stress. In B. Blobel, P. Pharow, & L. Parv (Eds.), Proceedings
of the 10th International Conference on Wearable Micro and Nano
Technologies for Personalized Health (pHealth) (pp. 83–88). IOS Press.
Begum, S., Barua, S., & Ahmed, M. U. (2017). In-Vehicle Stress Monitoring
Based on EEG Signal. International Journal of Engineering Research and
Applications, 07 (07), 55–71.
Borghini, G., Aricò, P., Di Flumeri, G., Cartocci, G., Colosimo, A., Bonelli, S.,
Golfetti, A., Imbert, J. P., Granger, G., Benhacene, R., Pozzi, S., &
Babiloni, F. (2017). EEG-Based Cognitive Control Behaviour Assessment:
An Ecological study with Professional Air Traffic Controllers. Scientific
Reports, 7 (1), 547.
Borghini, G., Aricò, P., Di Flumeri, G., Sciaraffa, N., Colosimo, A., Herrero, M.-T.,
Bezerianos, A., Thakor, N. V., & Babiloni, F. (2017). A New Perspective
for the Training Assessment: Machine Learning-Based Neurometric for
Augmented User’s Evaluation. Frontiers in Neuroscience, 11.
Borghini, G., Astolfi, L., Vecchiato, G., Mattia, D., & Babiloni, F. (2014). Measuring
Neurophysiological Signals in Aircraft Pilots and Car Drivers for the
Assessment of Mental Workload, Fatigue and Drowsiness. Neuroscience &
Biobehavioral Reviews, 44, 58–75.
Breiman, L. (2001). Random Forests. Machine Learning, 45 (1), 5–32.
Brookhuis, K. A., & de Waard, D. (2010). Monitoring Drivers’ Mental Workload
in Driving Simulators using Physiological Measures. Accident Analysis &
Prevention, 42 (3), 898–903.
Charles, R. L., & Nixon, J. (2019). Measuring Mental Workload using Physiological
Measures: A Systematic Review. Applied Ergonomics, 74, 221–232.
Corcoran, A. W., Alday, P. M., Schlesewsky, M., & Bornkessel-Schlesewsky, I. (2018).
Toward a Reliable, Automated Method of Individual Alpha Frequency (IAF)
Quantification. Psychophysiology, 55 (7), e13064.
116
Paper A2
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.).
John Wiley & Sons, Inc.
da Silva, F. P. (2014). Mental Workload, Task Demand and Driving Performance:
What Relation? Procedia - Social and Behavioral Sciences, 162, 310–319.
DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the Areas
under Two or More Correlated Receiver Operating Characteristic Curves: A
Nonparametric Approach. Biometrics, 44 (3), 837–845.
Delorme, A., & Makeig, S. (2004). EEGLAB: An Open Source Toolbox for Analysis
of Single-trial EEG Dynamics including Independent Component Analysis.
Journal of Neuroscience Methods, 134 (1), 9–21.
Di Flumeri, G., Arico, P., Borghini, G., Colosimo, A., & Babiloni, F. (2016). A New
Regression-based Method for the Eye Blinks Artifacts Correction in the EEG
Signal, without using any EOG Channel. Annual International Conference of
the IEEE Engineering in Medicine and Biology Society. IEEE Engineering
in Medicine and Biology Society. Annual International Conference, 2016,
3187–3190.
Di Flumeri, G., Borghini, G., Aricò, P., Colosimo, A., Pozzi, S., Bonelli, S., Golfetti,
A., Kong, W., & Babiloni, F. (2015). On the Use of Cognitive Neurometric
Indexes in Aeronautic and Air Traffic Management Environments. In B.
Blankertz, G. Jacucci, L. Gamberini, A. Spagnolli, & J. Freeman (Eds.),
Symbiotic Interaction. Symbiotic 2015. Lecture Notes in Computer Science
(pp. 45–56). Springer International Publishing.
Di Flumeri, G., Borghini, G., Aricò, P., Sciaraffa, N., Lanzi, P., Pozzi, S., Vignali, V.,
Lantieri, C., Bichicchi, A., Simone, A., & Babiloni, F. (2018). EEG-Based
Mental Workload Neurometric to Evaluate the Impact of Different Traffic
and Road Conditions in Real Driving Settings. Frontiers in Human
Neuroscience, 12, 509.
Di Flumeri, G., Borghini, G., Aricò, P., Sciaraffa, N., Lanzi, P., Pozzi, S., Vignali, V.,
Lantieri, C., Bichicchi, A., Simone, A., & Babiloni, F. (2019). EEG-Based
Mental Workload Assessment During Real Driving: A Taxonomic Tool for
Neuroergonomics in Highly Automated Environments. Neuroergonomics,
121–126.
Di Flumeri, G., De Crescenzio, F., Berberian, B., Ohneiser, O., Kramer,
J., Aricò, P., Borghini, G., Babiloni, F., Bagassi, S., & Piastra, S.
(2019). Brain–Computer Interface-Based Adaptive Automation to Prevent
Out-Of-The-Loop Phenomenon in Air Traffic Controllers Dealing With
Highly Automated Systems. Frontiers in Human Neuroscience, 13.
Elul, R. (1969). Gaussian Behavior of the Electroencephalogram: Changes during
Performance of Mental Task. Science, 164 (3877), 328–331.
Fastenmeier, W., & Gstalter, H. (2007). Driving Task Analysis as a Tool in Traffic
Safety Research and Practice. Safety Science, 45 (9), 952–979.
Fisher, D. L., Rizzo, M., Caird, J., & Lee, J. D. (2011). Handbook of Driving
Simulation for Engineering, Medicine, and Psychology: An Overview. In
D. L. Fisher, M. Rizzo, J. Caird, & J. D. Lee (Eds.), Handbook of Driving
Simulation for Engineering, Medicine, and Psychology (1st ed., pp. 1–16).
CRC Press.
Freedman, D. (2009). Statistical Models: Theory and Practice. Cambridge University
Press.
117
XAI for Enhancing Transparency in DSS
Galante, F., Bracco, F., Chiorri, C., Pariota, L., Biggero, L., & Bifulco, G. N. (2018).
Validity of Mental Workload Measures in a Driving Simulation Environment.
Journal of Advanced Transportation, 2018, e5679151.
Geethanjali, P., Mohan, Y. K., & Sen, J. (2012). Time Domain Feature Extraction
and Classification of EEG Data for Brain Computer Interface. 2012
9th International Conference on Fuzzy Systems and Knowledge Discovery
(FSKD), 1136–1139.
Gevins, A., & Smith, M. E. (2003). Neurophysiological Measures of Cognitive
Workload during Human-Computer Interaction. Theoretical Issues in
Ergonomics Science, 4 (1-2), 113–131.
Gevins, A., Smith, M. E., Leong, H., McEvoy, L., Whitfield, S., Du, R., & Rush, G.
(1998). Monitoring Working Memory Load during Computer-Based Tasks
with EEG Pattern Recognition Methods. Human Factors, 40 (1), 79–91.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene Selection for Cancer
Classification using Support Vector Machines. Machine Learning, 46 (1),
389–422.
Guzik, P., & Malik, M. (2016). ECG by Mobile Technologies. Journal of
Electrocardiology, 49 (6), 894–901.
Harms, L. (1991). Variation in Drivers’ Cognitive Load. Effects of Driving through
Village Areas and Rural Junctions. Ergonomics, 34 (2), 151–160.
Hart, S. G., & Staveland, L. E. (1988). Development of NASA-TLX (Task Load
Index): Results of Empirical and Theoretical Research. In P. A. Hancock &
N. Meshkati (Eds.), Advances in Psychology (pp. 139–183). North-Holland.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical
Learning. Springer New York.
Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science
& Engineering, 9 (3), 90–95.
Kar, S., Bhagat, M., & Routray, A. (2010). EEG Signal Analysis for the Assessment
and Quantification of Driver’s Fatigue. Transportation Research Part F:
Traffic Psychology and Behaviour, 13 (5), 297–306.
Kim, H., Yoon, D., Lee, S.-J., Kim, W., & Park, C. H. (2018). A Study on the
Cognitive Workload Characteristics according to the Ariving Behavior in
the Urban Road. 2018 International Conference on Electronics, Information,
and Communication (ICEIC), 1–4.
Kirk, R. E. (2012). Experimental Design. In I. B. Weiner, J. Schinka, & W. F. Velicer
(Eds.), Handbook of Psychology (2nd ed.). Wiley.
Lei, S., & Roetting, M. (2011). Influence of Task Combination on EEG Spectrum
Modulation for Driver Workload Estimation. Human Factors, 53 (2),
168–179.
Li, X., Zhang, P., Song, D., Yu, G., Hou, Y., & Hu, B. (2015). EEG Based Emotion
Identification Using Unsupervised Deep Feature Learning. Proceedings of the
SIGIR2015 Workshop on Neuro-Physiological Methods in IR Research.
Manawadu, U. E., Kawano, T., Murata, S., Kamezaki, M., Muramatsu, J., & Sugano,
S. (2018). Multiclass Classification of Driver Perceived Workload Using Long
Short-Term Memory based Recurrent Neural Network. 2018 IEEE Intelligent
Vehicles Symposium (IV), 1–6.
McKinney, W. (2010). Data Structures for Statistical Computing in Python.
Proceedings of the 9th Python in Science Conference, 56–61.
118
Paper A2
Moustafa, K., Luz, S., & Longo, L. (2017). Assessment of Mental Workload:
A Comparison of Machine Learning Methods and Subjective Assessment
Techniques. In L. Longo & M. C. Leva (Eds.), Human Mental Workload:
Models and Applications. H-WORKLOAD 2017. Communications in
Computer and Information Science (pp. 30–50). Springer International
Publishing.
Paxion, J., Galy, E., & Berthelon, C. (2014). Mental Workload and Driving. Frontiers
in Psychology, 5.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011).
Scikit-learn: Machine Learning in Python. Journal of Machine Learning
Research, 12 (85), 2825–2830.
R Core Team. (2013). R: The R Project for Statistical Computing (tech. rep.). R
Foundation for Statistical Computing. Vienna, Austria.
Rahman, H., Ahmed, M. U., Barua, S., & Begum, S. (2020). Non-contact-based
Driver’s Cognitive Load Classification using Physiological and Vehicular
Parameters. Biomedical Signal Processing and Control, 55, 101634.
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller,
M. (2011). pROC: An Open-source Package for R and S+ to Analyze and
Compare ROC Curves. BMC Bioinformatics, 12 (1), 77.
Saccá, V., Campolo, M., Mirarchi, D., Gambardella, A., Veltri, P., & Morabito,
F. C. (2018). On the Classification of EEG Signal by Using an SVM
Based Algorithm. In A. Esposito, M. Faudez-Zanuy, F. C. Morabito, & E.
Pasero (Eds.), Multidisciplinary Approaches to Neural Computing. Smart
Innovation, Systems and Technologies (pp. 271–278). Springer International
Publishing.
Saha, A., Minz, V., Bonela, S., Sreeja, S. R., Chowdhury, R., & Samanta, D. (2018).
Classification of EEG Signals for Cognitive Load Estimation Using Deep
Learning Architectures. In U. S. Tiwary (Ed.), Intelligent Human Computer
Interaction. IHCI 2018. Lecture Notes in Computer Science (pp. 59–68).
Springer International Publishing.
Sakai, M. (2013). Kernel Nonnegative Matrix Factorization with Constraint
Increasing the Discriminability of Two Classes for the EEG Feature
Extraction. 2013 International Conference on Signal-Image Technology &
Internet-Based Systems (SITIS), 966–970.
Sam, D., Velanganni, C., & Evangelin, T. E. (2016). A Vehicle Control System using
a Time Synchronized Hybrid VANET to Reduce Road Accidents caused by
Human Error. Vehicular Communications, 6, 17–28.
Sherwani, F., Shanta, S., Ibrahim, B. S. K. K., & Huq, M. S. (2016). Wavelet based
Feature Extraction for Classification of Motor Imagery Signals. 2016 IEEE
EMBS Conference on Biomedical Engineering and Sciences (IECBES),
360–364.
Smith, M. E., McEvoy, L. K., & Gevins, A. (1999). Neurophysiological Indices of
Strategy Development and Skill Acquisition. Cognitive Brain Research, 7 (3),
389–404.
119
XAI for Enhancing Transparency in DSS
Solovey, E. T., Zec, M., Garcia Perez, E. A., Reimer, B., & Mehler, B. (2014).
Classifying Driver Workload using Physiological and Driving Performance
Data: Two Field Studies. Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (CHI), 4057–4066.
Thomas, P., Morris, A., Talbot, R., & Fagerlind, H. (2013). Identifying the Causes of
Road Crashes in Europe. Annals of Advances in Automotive Medicine, 57,
13–22.
Travis, E. O. (2015). Guide to NumPy (2nd ed.). CreateSpace Independent
Publishing Platform.
van Rossum, G. (1995). Python Tutorial (tech. rep.). Centrum voor Wiskunde en
Informatica. Amsterdam, The Netherlands.
Verwey, W. B. (2000). On-line Driver Workload Estimation. Effects of Road Situation
and Age on Secondary Task Measures. Ergonomics, 43 (2), 187–209.
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau,
D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt,
S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J.,
Jones, E., Kern, R., Larson, E., . . . van Mulbregt, P. (2020). SciPy 1.0:
Fundamental algorithms for scientific computing in Python. Nature Methods,
17 (3), 261–272.
Wei, Z., Wu, C., Wang, X., Supratak, A., Wang, P., & Guo, Y. (2018). Using Support
Vector Machine on EEG for Advertisement Impact Assessment. Frontiers in
Neuroscience, 12.
Wen, T., & Zhang, Z. (2018). Deep Convolution Neural Network and
Autoencoders-Based Unsupervised Feature Learning of EEG Signals. IEEE
Access, 6, 25399–25410.
Wickens, C. D., McCarley, J. S., Alexander, A. L., Thomas, L. C., Ambinder, M.,
& Zheng, S. (2008). Attention-Situation Awareness (A-SA) Model of Pilot
Error. In D. C. Foyle & B. L. Hooey (Eds.), Human Performance Modeling
in Aviation (pp. 213–239). CRC Press.
Wilcoxon, F. (1992). Individual Comparisons by Ranking Methods. In S. Kotz &
N. L. Johnson (Eds.), Breakthroughs in Statistics (pp. 196–202). Springer
New York.
World Medical Association. (2001). World Medical Association Declaration of
Helsinki: Ethical Principles for Medical Research Involving Human Subjects.
Bulletin of the World Health Organization, 79 (4), 373.
Yin, Z., & Zhang, J. (2016). Recognition of Cognitive Task Load levels using single
channel EEG and Stacked Denoising Autoencoder. Proceedings of the 35th
Chinese Control Conference (CCC), 3907–3912.
120
Paper B
Abstract
Artificial intelligence (AI) and machine learning (ML) have recently
been radically improved and are now being employed in almost every
application domain to develop automated or semi-automated systems.
To facilitate greater human acceptability of these systems, explainable
artificial intelligence (XAI) has experienced significant growth over the
last couple of years with the development of highly accurate models but
with a paucity of explainability and interpretability. The literature shows
evidence from numerous studies on the philosophy and methodologies of XAI.
Nonetheless, there is an evident scarcity of secondary studies in connection
with the application domains and tasks, let alone review studies following
prescribed guidelines, that can enable researchers’ understanding of the
current trends in XAI, which could lead to future research for domain- and
application-specific method development. Therefore, this paper presents
a systematic literature review (SLR) on the recent developments of XAI
methods and evaluation metrics concerning different application domains
and tasks. This study considers 137 articles published in recent years and
identified through the prominent bibliographic databases. This systematic
synthesis of research articles resulted in several analytical findings: XAI
methods are mostly developed for safety-critical domains worldwide, deep
learning and ensemble models are being exploited more than other types
of AI/ML models, visual explanations are more acceptable to end-users
and robust evaluation metrics are being developed to assess the quality
† © 2022 by the Authors (CC BY 4.0). Reprinted from Islam, M. R., Ahmed, M. U., Barua,
S., & Begum, S. (2022). A Systematic Review of Explainable Artificial Intelligence in Terms of
Different Application Domains and Tasks. Applied Sciences, 12 (3), 1353.
123
XAI for Enhancing Transparency in DSS
B.1 Introduction
With the recent developments of artificial intelligence (AI) and machine learning
(ML) algorithms, people from various application domains have shown increasing
interest in taking advantage of these algorithms. As a result, AI and ML are
being used today in many application domains. Different AI/ML algorithms are
being employed to complement humans’ decisions in various tasks from diverse
domains, such as education, construction, health care, news and entertainment,
travel and hospitality, logistics, manufacturing, law enforcement, and finance (Rai,
2020). While these algorithms are meant to help users in their daily tasks, they still
face acceptability issues. Users often remain doubtful about the proposed decisions.
In worse cases, users oppose the AI/ML model’s decision since their inference
mechanisms are mostly opaque, unintuitive, and incomprehensible to humans. For
example, today, deep learning (DL) models demonstrate convincing results with
improved accuracy compared to established algorithms. DL models’ outstanding
performances hide one major drawback, i.e., the underlying inference mechanism
remains unknown to a user. In other words, the DL models function as a black-box
(Guidotti, Monreale, Ruggieri, et al., 2019). In general, almost all the prevailing
expert systems built with AI/ML models do not provide additional information to
support the inference mechanism, which makes systems nontransparent. Thus, it has
become a sine qua non to investigate how the inference mechanism or the decisions
of AI/ML models can be made transparent to humans so that these intelligent
systems can become more acceptable to users from different application domains
(Loyola-Gonzalez, 2019).
Upon realising the need to explain AI/ML model-based intelligent systems, a few
researchers started exploring and proposing methods long ago. The bibliographic
databases contain the earliest published evidence on the association between expert
systems and the term explanation from the mid-eighties (Neches et al., 1985). Over
time, the concept evolved to be an immense growing research domain of explainable
artificial intelligence (XAI). However, researchers did not pay much attention to XAI
until 2017/2018 which can be justified by the trend of publications per year with
the keyword explainable artificial intelligence in titles or abstracts from different
bibliographic databases illustrated in Figure B.1a. The increased attention paid
by researchers towards XAI from all the domains utilising systems developed with
AI/ML models was caused by three major incidents. First of all, the funding of
the “Explainable AI (XAI) Program” was funded in early 2017 by the Defense
Advanced Research Projects Agency (DARPA) (Gunning & Aha, 2019). After a
couple of months in mid-2017, the Chinese government released “The Development
Plan for New Generation of Artificial Intelligence” to encourage the high and strong
extensibility of AI (Xu et al., 2019). Last but not least, in mid-2018, the European
124
Paper B
(a) (b)
Figure B.1: Number of published articles (y-axis) on XAI made available through
four bibliographic databases in recent decades (x-axis). (a) Trend of the number of
publications from 1984 to 2020. (b) Specific number of publications from 2018 to
June 2021. The illustrated data were extracted on 01 July 2021 from four renowned
bibliographic databases. The asterisk (*) with 2021 refer to the partial data on the
number of publications on XAI until June. © 2022 by Islam et al. (CC BY 4.0).
(a) (b)
Figure B.2: Percentage of the selected articles on different XAI methods for different
application (a) domains and (b) tasks. © 2022 by Islam et al. (CC BY 4.0).
125
XAI for Enhancing Transparency in DSS
• To investigate and present the application domains and tasks for which various
XAI methods have been explored and exploited;
• To investigate and present the XAI methods, validation metrics and the type of
explanations that can be generated to increase the acceptability of the expert
systems to general users;
• To sort out the open issues and future research directions in terms of various
domains and application tasks from the methodological perspective of XAI.
The remainder of this article is arranged as follows: relevant concepts of XAI from
a technical point of view are presented in Section B.2, followed by a discussion on
prominent review studies previously conducted on XAI in Section B.3. Section B.4
contains the detailed workflow of this SLR, followed by the outcome of the performed
analyses in Section B.5. Finally, a discussion on the findings of this study and its
limitations and conclusions are presented in Sections B.6 and B.7, respectively.
This section concisely presents the theoretical aspects of XAI from a technical point
of view for a better understanding of the contents of this study. Emphatically,
the philosophy and taxonomy of XAI have been excluded from this manuscript
because they are out of the scope of this study. However, the term explainability
is associated with the interface between decision makers and humans. This interface
is synchronously comprehensible to humans and accurately represents the decision
maker (Guidotti, Monreale, Ruggieri, et al., 2019). Specifically, in XAI, the interface
between the models and the end-users is called explainability, through which an
end-user obtains clarification on the decisions that the AI/ML model provides them
with. Based on the literature, the concepts of XAI within different application
domains are categorised as stage, scope, input and output formats. This section
126
Paper B
includes a discussion on the most relevant aspects that seem necessary to make XAI
efficiently and credibly work on different applications. Figure B.3 summarises the
prime concepts behind developing XAI applications which were adopted from the
recent review studies by Vilone and Longo (2020, 2021a).
Figure B.3: Overview of the different concepts on developing methodologies for XAI,
adapted from the review studies by Vilone and Longo (2020, 2021a). © 2022 by Islam
et al. (CC BY 4.0).
• Ante hoc methods generally consider generating the explanation for the decision
from the very beginning of the training on the data while aiming to achieve
optimal performance. Mostly, explanations are generated using these methods
for transparent models, such as fuzzy models and tree-based models;
• Post hoc methods comprise an external or surrogate model and the base model.
The base model remains unchanged, and the external model mimics the base
model’s behaviour to generate an explanation for the users. Generally, these
methods are associated with the models in which the inference mechanism
remains unknown to users, e.g., support vector machines and neural networks.
Moreover, the post hoc methods are again divided into two categories:
model-agnostic and model-specific. The model-agnostic methods apply to any
AI/ML model, whereas the model-specific methods are confined to particular
models.
127
XAI for Enhancing Transparency in DSS
During the past couple of years, research on the developing theories, methodologies
and tools of XAI has been very active, and over time, the popularity of XAI
as a research domain has continued to increase. Before the massive attention of
researchers towards XAI, the earliest review that could be found in the literature
was that by Lacave and Díez (2002). They reviewed the then prevailing explanation
methods precisely for Bayesian networks. In the article, the authors referred to the
level and methods of explanations followed by several techniques that were mostly
probabilistic. Later, Ribeiro et al. (2016b) reviewed the suggested interpretable
models as a solution to the problem of adding explainability to AI/ML models,
such as additive models, decision trees, attention-based networks, and sparse linear
models. Subsequently, they proposed a model-agnostic technique that involves the
combined development of an interpretable model from the predictions of black-box
and perturbing inputs to observe the reaction of black-box models (Ribeiro et al.,
2016a).
With the remarkable implications of GDPR, an enormous number of works have
been published in recent years. The initial works included the notion of explainability
and its use from different points of view. Alonso et al. (2018) accumulated the
bibliometric information on the XAI domain to understand the research trends,
identify the potential research groups and locations, and discover possible research
directions. Goebel et al. (2018) discussed older concepts and linked them to newer
128
Paper B
concepts such as deep learning. Black-box models were compared with the white-box
models based on their advantages and disadvantages from a practical point of view
(Loyola-Gonzalez, 2019). Additionally, survey articles were published that advocated
that explainable models replace black-box models for high-stakes decision-making
tasks (Rudin, 2019; Rai, 2020). Surveys were also conducted on the methods of
explainability and addressed the philosophy behind the usage from the perspective
of different domains (Došilović et al., 2018; Mittelstadt et al., 2019; Samek & Müller,
2019) and stakeholders (Preece et al., 2018). Some works included the specific
definitions of technical terms, possible applications, and challenges towards attaining
responsible AI (Xu et al., 2019; Barredo Arrieta et al., 2020; Longo et al., 2020).
Adadi and Berrada (2018) and Guidotti, Monreale, Ruggieri, et al. (2019) separately
studied the available methods of explainability and clustered them in the form of
explanations, e.g., textual, visual, and numeric. However, the literature contains a
good number of review studies on specific forms or methods of explaining AI/ML
models. For example, Robnik-Šikonja and Bohanec (2018) conducted a literature
review on the perturbation-based explanations for prediction models, Q. Zhang et al.
(2018) surveyed the techniques of providing visual explanations for deep learning
models, and Dağlarli (2020) reviewed the XAI approaches for deep meta-learning
models.
Above all, several review studies were conducted by Vilone and Longo (2020,
2021a, 2021b) to gather and present the recent developments in XAI. These studies
presented extensive clustering of the XAI methods and evaluation metrics, which
makes the studies more robust than the other review studies from the literature.
However, none of these studies presented insights on the application domains and
tasks that are facilitated with the developments of XAI. However, researchers from
specific domains also surveyed the possibilities and challenges from their perspectives.
The literature contains most of the works from the medical and health care domains
(Fellous et al., 2019; Holzinger et al., 2019; Mathews, 2019; Jiménez-Luna et al.,
2020; Payrovnaziri et al., 2020; Ahmed et al., 2021; Gulum et al., 2021). However,
there are review articles available in the literature from the domains of industry
(Gade et al., 2019), software engineering (Dam et al., 2018), automotive (Chaczko
et al., 2020), etc.
In the studies mentioned above, the authors reviewed and analysed the concepts
and methodologies of XAI, challenges and possible actions to the solutions from the
perspective of individual domains or without concerning the application domains
and tasks. However, to our knowledge, none of the studies exploited XAI methods
considering different application domains and tasks as a whole. Moreover, a survey
following an SLR guideline to review the methods and evaluation metrics for XAI
to maintain a rigid objective throughout the study is still not present. Hence, in
this article, an established guideline for SLR (Kitchenham & Charters, 2007) was
followed to gather and analyse the available methods of adding explainability to
AI/ML models and the metrics of assessing the performance of the methods as well
as the quality of the generated explanations. In addition, this survey study produced
a general notion on the utilisation of XAI in different application domains based on
the selected articles.
129
XAI for Enhancing Transparency in DSS
Figure B.4: SLR methodology stages following the guidelines from Kitchenham and
Charters (2007). © 2022 by Islam et al. (CC BY 4.0).
130
Paper B
• RQ1: What are the application domains and tasks in which XAI is being
explored and exploited?
– RQ1.1: What are the XAI methods that have been used in the identified
application domains and tasks?
– RQ1.2: What are the different forms of providing explanations?
– RQ1.3: What are the evaluation metrics for XAI methods used in different
application domains and tasks?
131
XAI for Enhancing Transparency in DSS
exclusion from the SLR were articles that were related to the philosophy of XAI
and articles that were not published in any peer-reviewed conference proceedings
or journals. Throughout the article selection process, these inclusion and exclusion
criteria were considered.
Table B.1: Inclusion and exclusion criteria for the selection of research articles. ©
2022 by Islam et al. (CC BY 4.0).
To ensure the credibility of the selected articles, a checklist was designed. The list
contained 10 questions that were adapted from the guidelines for conducting an SLR
by Kitchenham and Charters (2007) and García-Holgado et al. (2020). Moreover,
to facilitate the validation, the questions were categorised on the basis of design,
conduct, analysis, and conclusion. The questions are outlined in Table B.2.
Table B.2: Questions for checking the validity of the selected articles. © 2022 by
Islam et al. (CC BY 4.0).
The process for identifying potential research articles included the identification,
screening, eligibility, and sorting of the selected articles. A step-by-step flow diagram
of this identification process is illustrated using the “Preferred Reporting Items
for Systematic Reviews and Meta-Analyses” (PRISMA) diagram by Moher et al.
(2009) in Figure B.5. The process started in June 2021. An initial search was
conducted using Google Scholar (https://scholar.google.com/ accessed on 30 June
2021) with the keyword explainable artificial intelligence to assess the available
sources of the research articles. The search results showed that most of the articles
were extracted from SpringerLink (https://link.springer.com/ accessed on 30 June
2021), Scopus (https://www.scopus.com/ accessed on 30 June 2021), IEEE Xplore
132
Paper B
Identification
Results of Bibliographic Databases (1709)
Articles Published in
the domain of AI (647)
excluded Duplicate/Preprint
Articles (376)
in Conferences/Journals (277)
excluded Notion/Review
Articles (159)
Thoroughly Scanned
Articles (118)
Included
Selected Articles
for Analysis (137)
Figure B.5: Flow diagram of the research article selection process adapted
from the PRISMA flow chart by Moher et al. (2009). The number of articles
obtained/included/excluded at different stages is presented in parentheses. © 2022
by Islam et al. (CC BY 4.0).
133
XAI for Enhancing Transparency in DSS
Feature Extraction. All the selected articles on the methodologies and evaluation
of explainability were divided among the authors for thorough scanning to extract
several features. The features were extracted from several viewpoints, namely
metadata, primary task, explainability, explanation, and evaluation. The features
extracted as metadata contained information regarding the dissemination of the
selected study. Features from the viewpoint of the primary task were extracted to
assess a general idea of the variety of AI/ML models that were deliberately used to
perform classification or regression tasks prior to adding explanations to the models.
The last three sets of features were extracted related to the concept of explainability,
the explored or proposed method of making AI/ML models explainable and the
evaluation of the methods and generated explanations, respectively. After extracting
the features, a feature matrix was built to concentrate all the information for further
analysis. The principal features from the feature matrix are concisely presented in
Table B.3.
134
Paper B
Table B.3: List of prominent features extracted from the selected articles. © 2022
by Islam et al. (CC BY 4.0).
135
XAI for Enhancing Transparency in DSS
B.5 Results
The findings from the performed analysis of the selected articles and the questionnaire
survey are presented concerning the viewpoints defined in Table B.3. To facilitate
a clear understanding, the subsections are titled with specific features, e.g., the
results from the analysis on primary tasks are presented in separate sections. Again,
the concepts of explainability are illustrated along with the methods to provide
explanations in the corresponding sections.
B.5.1 Metadata
This section presents the results obtained from analysing the metadata extracted from
the selected articles – primarily bibliometric data. Among the 137 selected articles,
83 were published in journals, and the rest were presented in conference proceedings.
As per the inclusion criteria of this SLR, all the articles were peer reviewed prior
to publication. In most of the articles, relevant keywords were the author-defined
keywords, which facilitates the indexing of the article in bibliographic databases.
The author-defined keywords were compared with the keywords extracted from the
abstracts of the articles through a word cloud approach. Figure B.6 illustrates the
word cloud of the author-defined keywords and the prominent words extracted from
the abstracts. The illustrated word clouds are expressed with varying font sizes.
More often occurring words are presented in larger fonts (Helbich et al., 2013) and
different colours are used to differentiate words with the same frequencies.
(a) (b)
Figure B.6: Word cloud of the (a) author-defined keywords and (b) keywords
extracted from the abstracts through natural language processing. The font size is
proportional to the number of occurrences of the terms and different colours are used
to discriminate terms with equal font size. Both figures illustrate remarkable terms of
XAI. However, the terms from keywords are more conceptual, whereas the abstracts
contain specific terms on the methods and tasks. © 2022 by Islam et al. (CC BY 4.0).
Figure B.7 presents the number of publications related to XAI from different
countries of the world. Here, the countries were determined based on the affiliations
of the first authors of the articles. The USA is the pioneer in the development of
XAI topics and is still in the leading position. Similarly, several countries in Europe
are following and have developed an increasing number of systems considering XAI.
Based on the number of publications, Asian countries are apparently still quiescent
in research and development on XAI.
136
Paper B
Publication Count
1 30
Figure B.7: Number of publications proposing new methods of XAI from different
countries of the world and the top 10 countries based on the publication count shown
in parentheses, which is approximately 72% of the 137 articles selected for this SLR.
The countries were determined from the affiliations of the first authors of the articles.
© 2022 by Islam et al. (CC BY 4.0).
137
XAI for Enhancing Transparency in DSS
Table B.4: List of references to selected articles published on the methods of XAI
from different application domains for the corresponding tasks. © 2022 by Islam et al.
(CC BY 4.0).
138
Paper B
Table B.4: (Continued) List of references to selected articles published on the methods
of XAI from different application domains for the corresponding tasks. © 2022 by Islam
et al. (CC BY 4.0).
139
XAI for Enhancing Transparency in DSS
Table B.4: (Continued) List of references to selected articles published on the methods
of XAI from different application domains for the corresponding tasks. © 2022 by Islam
et al. (CC BY 4.0).
Application Application
Tasks Domains
Figure B.8: Chord diagram (Tintarev et al., 2018) presenting the number of selected
articles published on the XAI methods and evaluation metrics from different application
domains for the corresponding tasks. © 2022 by Islam et al. (CC BY 4.0).
140
Paper B
articles selected from different application domains and further clustered the number
of articles in terms of AI/ML model types, stage, scope, and form of explanations.
In the following subsections, shreds of evidence of linkage between the application
domains and concepts of XAI are presented.
Application Domain Model Type Stage Scope Form
Figure B.9: Number of the selected articles published from different application
domains and clustered on the basis of AI/ML model type, stage, scope, and form of
explanations. The number of articles with each of the properties is given in parentheses.
© 2022 by Islam et al. (CC BY 4.0).
141
XAI for Enhancing Transparency in DSS
Figure B.10: Venn diagram with the number of articles using different forms of
data to assess the functional validity of the proposed XAI methodologies. The sizes
of the circles are approximately proportional to the number of articles (shown within
parentheses) that were observed in this review study. © 2022 by Islam et al. (CC BY
4.0).
Table B.5: Different models used to solve the primary task of classification or
regression and their study count. © 2022 by Islam et al. (CC BY 4.0).
142
Paper B
Table B.5: (Continued) Different models used to solve the primary task of
classification or regression and their study count. © 2022 by Islam et al. (CC BY
4.0).
143
XAI for Enhancing Transparency in DSS
Table B.5: (Continued) Different models used to solve the primary task of
classification or regression and their study count. © 2022 by Islam et al. (CC BY
4.0).
144
Paper B
models were exploited in most of the studies (63) from the selected articles. The
second-highest number of studies (21) utilised the ensemble techniques for performing
the primary supervised or unsupervised tasks. Based on the increased interest of
researchers in neural networks and ensemble techniques, it can be inevitably assumed
that these models were chosen to incorporate explainability because of their wide
acceptability over various domains in terms of their performances. In addition to
the renowned algorithms, there are some other algorithms, such as probabilistic soft
logic (PSL) (Kouki et al., 2020), LSP (Dujmović, 2020), sequential rule mining (SRM)
(Anguita-Ruiz et al., 2020), preference learning (Lamy et al., 2020), Cartesian genetic
programming (CGP) (Senatore et al., 2019), Predomics (Prifti et al., 2020), and
TriRank (He et al., 2015). The acronyms of the model types are further referenced
in Table B.6 to indicate their relation to the core AI/ML models.
Throughout this study, it was evident that most of the research works were
domain-agnostic. For specific domains, healthcare, industry, and transportation were
revealed to be more exploited than other domains. In these domains, as stated
above, diverse forms of neural networks had been invoked to perform different tasks
(see Figure B.9) followed by other types of models, as listed in Table B.5. The
numbers associated with different model types stated in Figure B.9 and Table B.5
varied because the illustration presents the number of articles and the table lists the
number of variations of the models. It was observed that in some articles, the authors
presented theirs using different models of similar types.
Textual (14)
Visualisation (52)
Mixed (35)
Figure B.11: Distribution of the selected articles based on the stage, scope, and
form of explanations. The number of articles with each of the properties is given in
parentheses. © 2022 by Islam et al. (CC BY 4.0).
145
XAI for Enhancing Transparency in DSS
were also deployed to provide explainability in the selected articles of this review,
such as Anchors (Ribeiro et al., 2018), Explain Like I’m Five (ELI5) (Serradilla et
al., 2020), Local Interpretable Model-Agnostic Explanations (LIME) (Ribeiro et al.,
2016a), and Model Agnostic Supervised Local Explanations (MAPLE) (Plumb et al.,
2018). LIME was modified and proposed as SurvLIME by Kovalev and Utkin (2020).
Afterwards, the authors incorporated well-known Kolmogorov–Smirnov bounds to
SurvLIME and proposed SurvLIME-KS (Kovalev et al., 2020). The authors also
utilised feature importance to generate numeric explanations in several research works
(Štrumbelj & Kononenko, 2014; Rehse et al., 2019; Anysz et al., 2020; Pintelas et al.,
2020). The Shapley Additive Explanations (SHAP) was proposed by Lundberg and
Lee (2017), and it was later used by several authors to generate mixed explanations
containing numbers, texts, and visualisations (D. Wang et al., 2019; Ponn et al.,
2020). However, another variant of SHAP, Deep-SHAP, was proposed to explicitly
explain deep learning models. Two very recent studies proposed Cluster-Aided
Space Transformation for Local Explanation (CASTLE) (La Gatta et al., 2021a)
and Pivot-Aided Space Transformation for Local Explanation (PASTLE) (La Gatta
et al., 2021b). The authors claimed that a higher quality of local explanations can
be generated with these methods than with the prevailing methods for unsupervised
and supervised tasks, respectively.
In terms of application domains, post hoc techniques are more developed for
producing explanations at the local scope. One can see in the illustration of Figure
B.9 that the majority of the post hoc techniques were developed for complex models
such as neural networks and ensemble models. On the other hand, most of the
ante hoc techniques are associated with fuzzy and tree-based models across all the
application domains.
146
Paper B
(a) (b)
(c) (d)
Carletti et al. (2019) used depth-based isolation forest feature importance (DIFFI) to
support the decisions from depth-based isolation forests (IFs) in anomaly detection
for industrial applications, and the FDE measure was developed to add precise
explainability for failure diagnosis in automated industries (ten Zeldam et al., 2018).
Moreover, several model-agnostic tools generate numeric explanations, e.g., Anchors
(Ribeiro et al., 2018), ELI5, LIME (Serradilla et al., 2020), SHAP (Ponn et al., 2020),
and LORE (D. Wang et al., 2019). Moreover, Table B.6 contains additional examples
of numeric explanations, and the methods are clustered on the basis of stage and the
scope of explanations. However, the numeric explanations demand high expertise in
the corresponding domains as they are associated with the features. This assumption
supports the low number of studies on numeric explanations, as shown in Figure B.9.
147
XAI for Enhancing Transparency in DSS
148
Paper B
Table B.6: Methods for explainability, stage (Ah: ante-hoc, Ph: post-hoc) and scope
(L: local, G: global) of explainability, forms of explanations (N : numeric, R: rules, T :
textual, V : Visual) and the type of models used for performing the primary tasks (refer
to Table B.5 for the elaborations of the model types). © 2022 by Islam et al. (CC BY
4.0).
149
XAI for Enhancing Transparency in DSS
Table B.6: (Continued) Methods for explainability, stage (Ah: ante-hoc, Ph:
post-hoc) and scope (L: local, G: global) of explainability, forms of explanations (N :
numeric, R: rules, T : textual, V : Visual) and the type of models used for performing
the primary tasks (refer to Table B.5 for the elaborations of the model types). © 2022
by Islam et al. (CC BY 4.0).
150
Paper B
Table B.6: (Continued) Methods for explainability, stage (Ah: ante-hoc, Ph:
post-hoc) and scope (L: local, G: global) of explainability, forms of explanations (N :
numeric, R: rules, T : textual, V : Visual) and the type of models used for performing
the primary tasks (refer to Table B.5 for the elaborations of the model types). © 2022
by Islam et al. (CC BY 4.0).
151
XAI for Enhancing Transparency in DSS
Table B.6: (Continued) Methods for explainability, stage (Ah: ante-hoc, Ph:
post-hoc) and scope (L: local, G: global) of explainability, forms of explanations (N :
numeric, R: rules, T : textual, V : Visual) and the type of models used for performing
the primary tasks (refer to Table B.5 for the elaborations of the model types). © 2022
by Islam et al. (CC BY 4.0).
152
Paper B
Figure B.13: UpSet plot presenting the distribution of different methods of evaluating
the explainable systems. The vertical bars in the bottom-left represent the number
of studies conducting each of the methods. The single and connected black circles
represent the combination of the evaluation methods and the horizontal bars illustrate
their number of studies. © 2022 by Islam et al. (CC BY 4.0).
devoted to exploring new methodologies of XAI. In this study, only nine articles
among the selected articles were found to be fully intended for the evaluation
of and metrics for XAI. However, all the articles proposing new methods to add
explainability considered one of the three techniques to assess their explainable
model or the explanations generated by the models. These techniques were (i) user
studies; (ii) synthetic experiments; and (iii) real experiments. The number of studies
adopting each of the techniques are illustrated in Figure B.13. It was observed that
most of the studies invoked user studies and synthetic experiments as standalone
methods for evaluating the proposed explainable systems. Very few studies only
used real experiments to evaluate their proposed systems. However, several studies
conducted a combination of the user studies, real and synthetic experiments in the
evaluation process as illustrated in the UpSet plot in Figure B.13. User studies were
mostly performed to evaluate the quality of the generated explanation in the form
of case studies and questionnaire surveys. Generally, these cases are formulated by
the researchers combining a real or synthetic scenario that is associated with some
prediction/classification output and its explanation in any of the forms presented in
Section B.5.3.4. The surveys were observed to be conducted among the respective
domain experts. They had to answer questions on the understandability and quality
of the explanations from the presented case studies. To facilitate the user studies,
Holzinger et al. (2020) proposed the System Causability Scale (SCS) to measure the
quality of explanations. In simpler terms, the SCS resembles the widely known Likert
scale (Albaum, 1997). In earlier work, Chander and Srinivasan (2018) introduced the
notion of the cognitive value of an explanation and related its function in generating
significant explanations within a given setting. Lage et al. (2018) proposed the
methodology of a user study to measure the human-interpretability of logic-based
explanations. The prime metrics were the response time for understanding, the
accuracy of understanding, and the subjective satisfaction of the users. Ribeiro et al.
153
XAI for Enhancing Transparency in DSS
Different types of experiments with real and synthetic data were performed to
quantify various metrics for the generated explanations to evaluate the quality of
the explanations. Vilone and Longo (2021b) proposed two types of evaluation
methods for assessing the quality of the explanations; objective and human-centred.
Human-centred methods are mostly performed through user studies as discussed
earlier. The prominent objective measures are briefly stated here. Guidotti,
Monreale, Ruggieri, et al. (2019) used fidelity, l-fidelity, and hit scores and proposed
the use of the Jaccard measure of stability, the number of falsified conditions in
counterfactual rules, the rate of the agreement of black-box and counterfactual
decisions for counterfactual instances, F1-score of agreement of black-box and
counterfactual decisions, etc. In another work, stability was proposed as an
objective function that acts as an inhibitor to include too many terms in the textual
154
Paper B
explanations (Hatwell et al., 2020). To evaluate the visual explanations, Bach et al.
(2015) proposed a pixel-flipping method that enables users to discriminate between
two heatmaps. Moreover, sentence evaluation metrics, such as METEOR and CIDEr
were used to evaluate textual explanations associated with visualisations (Hendricks
et al., 2016). Samek et al. (2017) proposed the Area over the MoRF (Most Relevant
First) Curve (AOPC) to measure the impact on classification performance when
generating a visual explanation. In the proposition, the authors illustrated that a
large AOPC value provides a good measure for a very informative heatmap. AOPC
can assess the amount of information present in a visual explanation but it lacks
in terms of being able to assess the quality of the understandability of the users.
In another study, Rio-Torto et al. (2020) proposed the Percentage of Meaningful
Pixels Outside the Mask (POMPOM) as another measurable criterion of explanation
quality. POMPOM is defined as the ratio between the number of meaningful pixels
outside the region of interest and the total number of pixels in the image. The
authors have also conducted a comparative study with AOPC and POMPOM. They
concluded that POMPOM generates superior results for the supervised approach
whereas AOPC has the upper hand for the unsupervised approach. Significantly,
Sokol and Flach (2020) provided a comprehensive and representative taxonomy and
associated descriptors in the form of a fact sheet with five dimensions that can help
researchers develop and evaluate new explainability approaches.
The associations among the evaluation methods and different application domains
and applications are illustrated in Figure B.14. It can be easily observed that
synthetic experiments and user studies were mostly used to evaluate proposed
explainable systems from the domains of healthcare and industry. Moreover, a
good number of domain-specific studies also utilised the aforementioned evaluation
methods. In terms of specific tasks, user studies were mostly conducted for evaluating
recommender systems. Very few studies have conducted real experiments, which
were found to be from healthcare and industry domains for decision support, image
processing, and predictive maintenance.
B.6 Discussion
The continuously growing interest in the research domain of XAI worldwide resulted
in the publication of a large number of research articles containing diverse knowledge
of explainability from different perspectives. In the published articles, it is often
noticed that similar terms are used interchangeably (Barredo Arrieta et al., 2020),
which is one of the major hurdles for a new researcher to initiate work on developing
a new methodology of XAI. In addition, an “Explainable AI (XAI) Program” by
DARPA (Gunning & Aha, 2019), the Chinese Government’s “The Development Plan
for New Generation of Artificial Intelligence” (Xu et al., 2019) and the GDPR by the
EU (Wachter et al., 2018) escalated the number of research studies during the past
couple of years, as demonstrated in Figure B.1. The literature shows several review
and survey studies on XAI philosophy, taxonomy, methodology, evaluation, etc.
Nevertheless, to our knowledge, no study has been performed that has wholly focused
on the XAI methodologies from the perspective of different application domains and
tasks, let alone following some prescribed technique of conducting literature reviews.
In contrast, this SLR followed a proper guideline (Kitchenham & Charters, 2007)
155
XAI for Enhancing Transparency in DSS
that precisely defines the methodology of surveying the recent developments in XAI
techniques and evaluation criteria. One of the major advantages of an SLR is that the
methodology contains a workflow for reviewing literature by defining and addressing
specific RQs to restrict the subject matter of a study to the scope of the designated
topic. Here, the RQs presented in Section B.4.1.2 were purposefully designed to
review the development and evaluation of XAI methodologies and were addressed
with the presented outcomes of the study listed in Section B.5.
This study started with the task of scanning more than a thousand peer-reviewed
articles from different bibliographic databases. Following the process described in
Section B.4.2.1, 137 articles were thoroughly analysed to summarise the recent
developments. Among the selected articles, 19 were added through the snowballing
search, prescribed by Wohlin (2014). Here, the cited articles in the pre-selected
articles were checked to identify more articles that met this study’s inclusion criteria.
While conducting the snowballing search, some of the articles meeting the inclusion
criteria were found to be published prior to the defined period of 2018 – 2020 in
the inclusion criteria (Table B.1) but were apparently very significant in terms of
content as they were cited in many of the pre-selected articles. Considering the
impact of those articles in developing XAI methodologies, they were included in
the study despite not completely meeting the inclusion criteria. Moreover, during
the screening of articles, some of the articles were unintentionally overlooked due
to the use of the specific keyword searched (explainable artificial intelligence) in the
bibliographic databases. For example, this could be the article in which Spinner et al.
(2020) presented a visual analytics framework for interactive and explainable machine
learning. For some unforeseen reason, the index terms of the article did not contain
the aforementioned search keyword, but the abstract and keywords of the articles
contained the term “Explainable AI”. The interchangeable use of several closely
related terms (e.g., interpretability, transparency, and explainability) in metadata
impedes the proper acquisition of knowledge on XAI. As a result, a few potentially
significant articles were overlooked during this review study. The absence of acquired
knowledge from the neglected articles can be considered a limitation of this SLR.
The selected articles were analysed from five different viewpoints, i.e., metadata,
primary task, explainability, the form of explanation, and the evaluation of methods
and explanations. The prominent features from the respective viewpoints are
summarised in Table B.3. The features and possible alternatives were set in such
a way that the result of the analysis can substantially address the RQs. Section
B.5 presents the outcomes of the analysis by identifying insights into the domains
and applications in which XAI is developing, the prevailing methods of generating
and evaluating explanations, etc. This information is thus readily available for
prospective researchers from miscellaneous domains to instigate research projects
on the methodological development of XAI. In addition, a questionnaire survey was
designed and administered to the authors of the selected articles with several aims:
to cure the extracted feature values from the articles, to assess the credibility of
the definition of the features, etc. The questionnaire was distributed to the authors
through email, and the response rate was approximately 50%. The responses were
apparently similar to the information extracted from the articles, except in a few
cases. For example, from the article, it was found that the input data for the method
developed by Dujmović (2020) were numeric. In contrast, from the author’s response,
156
Paper B
the input data were mentioned as LSP, and this information was incorporated in the
analysis. This instance of curating, clarifying, and cross-checking the information
extracted from the articles advocates the need for a questionnaire survey. This
review study took advantage of the questionnaire survey to assess the credibility
of the literature reviewer as well as clarify the information.
During the exploration of the contents of the sorted-out articles, the first step was
to analyse the metadata. To determine the relevancy of the articles, keywords that
were explicitly defined by the authors and keywords extracted from the abstracts
were investigated in the form of word clouds following the methodology developed
by Helbich et al. (2013). It was observed that the significant terms were explainable
artificial intelligence, deep learning, machine learning, explainability, visualisation
etc. These terms were considered significant due to their larger appearance in
the word cloud, which resulted from repeated occurrences of the terms in the
supplied texts. In addition, a higher number of occurrences of terms, such as deep
learning or visualisation, aligns with the higher number of studies with concepts
presented in Tables B.5 and B.6, indicating tunnel vision in XAI development. More
attention towards less investigated models, such as SVM and neuro-fuzzy models and
visualisation techniques would add more value and novelty towards XAI. Moreover,
the prominent terms are strongly related to the primary concept of this study,
which increases the confidence in the selected articles that they are related. In
addition, the terms from the author-defined keywords were more conceptual than
the terms from the abstracts of the articles. On the other hand, the abstracts
contained more specific terms based on the application tasks and AI/ML models.
From the metadata, the countries of the authors’ affiliations were evaluated, and
it was found that the USA leads by a significant margin in terms of the number
of publications. However, the collective publications from the countries belonging
to the EU exceed the number of publications from the USA. This high number
of publications indicated the immense impact of imposing various regulations and
expressing interest through different programs from different governments. Although
there was a development plan on XAI from the Government of China, the number of
screened articles was lower, and they were published by the authors affiliated with the
institutions in China. Overall, it can be stated that the number of research studies
on XAI escalated in the regions where the government authorities put forward some
programs or regulations. Concerning the recent regulatory developments, it is safe to
assume that the government funding agencies have increased patronising this specific
field which has resulted in a higher number of research publications, as shown in
Figure B.7.
In the subsequent sections, significant aspects of developing XAI methods are
discussed, including addressing the RQs (defined in Section B.4.1.2) with respect to
the defined features and outcomes of the performed analyses.
157
XAI for Enhancing Transparency in DSS
the growing variety of data forms, more concentration is required to explain models
and decisions that can be derived from other forms of data, such as graphs and texts.
However, from the findings of this study, it is apparent that some specific forms of
data are already being exploited by the researchers of respective subjects in a limited
margin; for example, graph structures are considered as input to XAI methodologies
developed with fuzzy and neuro-fuzzy models. The uses of different input data types
are illustrated in Figure B.10 within the structure of a Venn diagram as many of the
articles used multiple types of input data for their proposed models, and the Venn
diagram has the capability of presenting combined relations in terms of frequencies.
While investigating the models that were designed or applied to solve primary
tasks, it was observed that most of the studies were performed concerning neural
networks. Specifically, out of 122 articles on XAI methods, 60 articles presented
work with various neural networks. The reason behind this overwhelming interest
of researchers towards making neural networks explainable is undoubtedly the
performance of these types of models in various tasks from diverse domains. A good
number of studies utilised ensemble methods, fuzzy models and tree-based models.
Other significant types of models were found to be SVM, CBR and Bayesian models
(Table B.5).
158
Paper B
159
XAI for Enhancing Transparency in DSS
160
Paper B
system, finance and academia, in contrast with the domains of healthcare and
industry. Further exploitation of the methods can be performed for the less
developed domains in terms of XAI;
• One of the promising research areas in the domain of networking is the
Internet of Things (IoT). The literature indicates that several applications
such as anomaly detection (Forestiero, 2021) and building information systems
(Forestiero et al., 2008; Forestiero & Papuzzo, 2021) for IoT have been
facilitated by agent-based algorithms. These applications can be further
associated with XAI methods to make them more acceptable to end-users;
• The impact of the dataset (particularly the effect of dataset imbalance, feature
dimensionality, different types of bias problems in data acquisition and dataset,
etc.) on developing an explainable model can be assessed through studies;
• It was observed that most of the works were performed done for neural networks
and through post hoc methods, explanations were generated at the local scope.
Similar cases were also observed for other models, such as SVM and ensemble
models, since their inference mechanism remains unclear to users. Although
several studies have shown approaches to produce explanations at a global
scope by mimicking the models’ behaviour, they lack performance accuracy.
More investigations can be carried out to produce an explanation in a global
scope without compromising the models’ performance for the base task;
• The major challenge of evaluating an explanation is to develop a method
that can deal with the different levels of expertise and understanding of
users. Generally, these two characteristics of users vary from person to person.
Substantial research is needed to establish a proper methodology for evaluating
the explanations based on the intended users’ expertise and capacity;
• User studies were invoked to validate explanations based on natural language,
in short, textual explanations. Automated evaluation metrics for textual
explanations are not yet prominent in the research works;
• Evaluating the quality of heatmaps as a form of visualisation is still
undiscovered beyond the visual assessment technique. In addition to heatmaps,
evaluation metrics for other visualisation techniques, e.g., saliency maps, are
yet to be defined.
B.7 Conclusion
161
XAI for Enhancing Transparency in DSS
methodologies. However, articles published after the mentioned period were not
analysed during this study due to time constraints. Several articles were also excluded
because of specific search keywords used in the bibliographic databases. More
comprehensive primary and secondary analyses on the methodological development of
XAI are required across different application domains. We believe such studies could
expedite the human acceptability of intelligent systems. Accommodating the varying
levels of expertise will also help understand different user groups’ needs. These studies
would explicitly explore underlying characteristics of the transparent models (fuzzy,
CBR, etc.) deployed for respective tasks, carefully analyse the dataset’s impact, and
consider well-established metrics for evaluating all forms of explanations.
Bibliography
Adadi, A., & Berrada, M. (2018). Peeking Inside the Black-Box: A Survey on
Explainable Artificial Intelligence (XAI). IEEE Access, 6, 52138–52160.
Aghamohammadi, M., Madan, M., Hong, J. K., & Watson, I. (2019). Predicting Heart
Attack Through Explainable Artificial Intelligence. In J. M. F. Rodrigues,
P. J. S. Cardoso, J. Monteiro, R. Lam, V. V. Krzhizhanovskaya, M. H. Lees,
J. J. Dongarra, & P. M. Sloot (Eds.), Computational Science – ICCS 2019
(pp. 633–645). Springer International Publishing.
162
Paper B
Ahmed, M. U., Barua, S., & Begum, S. (2021). Artificial Intelligence, Machine
Learning and Reasoning in Health Informatics—Case Studies. In M. A. R.
Ahad & M. U. Ahmed (Eds.), Signal Processing Techniques for
Computational Health Informatics (pp. 261–291). Springer International
Publishing.
Albaum, G. (1997). The Likert Scale Revisited. International Journal of Market
Research, 39 (2), 1–21.
Alonso, J. M., Castiello, C., & Mencar, C. (2018). A Bibliometric Analysis
of the Explainable Artificial Intelligence Research Field. In J. Medina,
M. Ojeda-Aciego, J. L. Verdegay, D. A. Pelta, I. P. Cabrera, B.
Bouchon-Meunier, & R. R. Yager (Eds.), Information Processing and
Management of Uncertainty in Knowledge-Based Systems. Theory and
Foundations (pp. 3–15). Springer International Publishing.
Alonso, J. M., Ducange, P., Pecori, R., & Vilas, R. (2020). Building Explanations for
Fuzzy Decision Trees with the ExpliClas Software. 2020 IEEE International
Conference on Fuzzy Systems (FUZZ-IEEE), 1–8.
Alonso, J. M., Toja-Alamancos, J., & Bugarín, A. (2020). Experimental Study on
Generating Multi-modal Explanations of Black-box Classifiers in terms of
Gray-box Classifiers. 2020 IEEE International Conference on Fuzzy Systems
(FUZZ-IEEE), 1–8.
Amparore, E., Perotti, A., & Bajardi, P. (2021). To Trust or Not to Trust an
Explanation: Using LEAF to Evaluate Local Linear XAI Methods. PeerJ
Computer Science, 7, e479.
Angelov, P., & Soares, E. (2020). Towards Explainable Deep Neural Networks
(xDNN). Neural Networks, 130, 185–194.
Anguita-Ruiz, A., Segura-Delgado, A., Alcalá, R., Aguilera, C. M., & Alcalá-Fdez,
J. (2020). eXplainable Artificial Intelligence (XAI) for the Identification
of Biologically Relevant Gene Expression Patterns in Longitudinal Human
Studies, Insights from Obesity Research. PLOS Computational Biology,
16 (4), e1007792.
Anysz, H., Brzozowski, Ł., Kretowicz, W., & Narloch, P. (2020). Feature Importance
of Stabilised Rammed Earth Components Affecting the Compressive
Strength Calculated with Explainable Artificial Intelligence Tools. Materials,
13 (10), 2317.
Apicella, A., Isgrò, F., Prevete, R., & Tamburrini, G. (2020). Middle-Level Features
for the Explanation of Classification Systems by Sparse Dictionary Methods.
International Journal of Neural Systems, 30 (08), 2050040.
Assaf, R., & Schumann, A. (2019). Explainable Deep Neural Networks for
Multivariate Time Series Predictions. Proceedings of the Twenty-Eighth
International Joint Conference on Artificial Intelligence (IJCAI), 6488–6490.
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., & Samek, W.
(2015). On Pixel-Wise Explanations for Non-Linear Classifier Decisions by
Layer-Wise Relevance Propagation. PLOS ONE, 10 (7), e0130140.
Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik,
S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R.,
Chatila, R., & Herrera, F. (2020). Explainable Artificial Intelligence (XAI):
Concepts, Taxonomies, Opportunities and Challenges toward Responsible
AI. Information Fusion, 58, 82–115.
163
XAI for Enhancing Transparency in DSS
164
Paper B
Csiszár, O., Csiszár, G., & Dombi, J. (2020). Interpretable Neural Networks
based on Continuous-valued Logic and Multicriteria Decision Operators.
Knowledge-Based Systems, 199, 105972.
Dağlarli, E. (2020). Explainable Artificial Intelligence (xAI) Approaches and Deep
Meta-Learning Models. In Advances and Applications in Deep Learning.
IntechOpen.
D’Alterio, P., Garibaldi, J. M., & John, R. I. (2020). Constrained Interval
Type-2 Fuzzy Classification Systems for Explainable AI (XAI). 2020 IEEE
International Conference on Fuzzy Systems (FUZZ-IEEE), 1–8.
Dam, H. K., Tran, T., & Ghose, A. (2018). Explainable software analytics.
Proceedings of the 40th International Conference on Software Engineering:
New Ideas and Emerging Results (ICSE-NIER), 53–56.
Da’u, A., & Salim, N. (2020). Recommendation System based on Deep Learning
Methods: A Systematic Review and New Directions. Artificial Intelligence
Review, 53 (4), 2709–2748.
De, T., Giri, P., Mevawala, A., Nemani, R., & Deo, A. (2020). Explainable AI: A
Hybrid Approach to Generate Human-Interpretable Explanation for Deep
Learning Prediction. Procedia Computer Science, 168, 40–48.
de Sousa, I. P., Maria Bernardes Rebuzzi Vellasco, M., & Costa da Silva, E. (2019).
Local Interpretable Model-Agnostic Explanations for Classification of Lymph
Node Metastases. Sensors, 19 (13), 2969.
Díaz-Rodríguez, N., & Pisoni, G. (2020). Accessible Cultural Heritage through
Explainable Artificial Intelligence. Adjunct Publication of the 28th ACM
Conference on User Modeling, Adaptation and Personalization (UMAP),
317–324.
Dindorf, C., Teufl, W., Taetz, B., Bleser, G., & Fröhlich, M. (2020). Interpretability
of Input Representations for Gait Classification in Patients after Total Hip
Arthroplasty. Sensors, 20 (16), 4385.
Došilović, F. K., Brčić, M., & Hlupić, N. (2018). Explainable Artificial Intelligence:
A Survey. 2018 41st International Convention on Information and
Communication Technology, Electronics and Microelectronics (MIPRO),
0210–0215.
Dujmović, J. (2020). Interpretability and Explainability of LSP Evaluation Criteria.
2020 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 1–8.
Dutta, V., & Zielińska, T. (2020). An Adversarial Explainable Artificial Intelligence
(XAI) Based Approach for Action Forecasting. Journal of Automation,
Mobile Robotics and Intelligent Systems, 3–10.
Eisenstadt, V., Espinoza-Stapelfeld, C., Mikyas, A., & Althoff, K.-D. (2018).
Explainable Distributed Case-Based Support Systems: Patterns for
Enhancement and Validation of Design Recommendations. In M. T. Cox, P.
Funk, & S. Begum (Eds.), Case-Based Reasoning Research and Development
(pp. 78–94). Springer International Publishing.
Fellous, J.-M., Sapiro, G., Rossi, A., Mayberg, H., & Ferrante, M. (2019). Explainable
Artificial Intelligence for Neuroscience: Behavioral Neurostimulation.
Frontiers in Neuroscience, 13.
Féraud, R., & Clérot, F. (2002). A Methodology to Explain Neural Network
Classification. Neural Networks, 15 (2), 237–246.
165
XAI for Enhancing Transparency in DSS
Fernández, R. R., Martín de Diego, I., Aceña, V., Fernández-Isabel, A., & Moguerza,
J. M. (2020). Random Forest Explainability using Counterfactual Sets.
Information Fusion, 63, 196–207.
Ferreyra, E., Hagras, H., Kern, M., & Owusu, G. (2019). Depicting Decision-Making:
A Type-2 Fuzzy Logic Based Explainable Artificial Intelligence System for
Goal-Driven Simulation in the Workforce Allocation Domain. 2019 IEEE
International Conference on Fuzzy Systems (FUZZ-IEEE), 1–6.
Forestiero, A. (2021). Metaheuristic Algorithm for Anomaly Detection in Internet of
Things leveraging on a Neural-driven Multiagent System. Knowledge-Based
Systems, 228, 107241.
Forestiero, A., Mastroianni, C., & Spezzano, G. (2008). Reorganization and Discovery
of Grid Information with Epidemic Tuning. Future Generation Computer
Systems, 24 (8), 788–797.
Forestiero, A., & Papuzzo, G. (2021). Agents-Based Algorithm for a Distributed
Information System in Internet of Things. IEEE Internet of Things Journal,
8 (22), 16548–16558.
Gade, K., Geyik, S. C., Kenthapadi, K., Mithal, V., & Taly, A. (2019). Explainable AI
in Industry. Proceedings of the 25th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining (KDD), 3203–3204.
Galhotra, S., Pradhan, R., & Salimi, B. (2021). Explaining Black-Box Algorithms
Using Probabilistic Contrastive Counterfactuals. Proceedings of the 2021
International Conference on Management of Data (SIGMOD), 577–590.
García-Holgado, A., Marcos-Pablos, S., & García-Peñalvo, F. (2020). Guidelines for
performing Systematic Research Projects Reviews. International Journal of
Interactive Multimedia and Artificial Intelligence, 6 (Regular Issue), 136–145.
García-Magariño, I., Muttukrishnan, R., & Lloret, J. (2019). Human-Centric AI for
Trustworthy IoT Systems With Explainable Multilayer Perceptrons. IEEE
Access, 7, 125562–125574.
Genc-Nayebi, N., & Abran, A. (2017). A Systematic Literature Review: Opinion
Mining Studies from Mobile App Store User Reviews. Journal of Systems
and Software, 125, 207–219.
Goebel, R., Chander, A., Holzinger, K., Lecue, F., Akata, Z., Stumpf, S., Kieseberg,
P., & Holzinger, A. (2018). Explainable AI: The New 42? In A. Holzinger,
P. Kieseberg, A. M. Tjoa, & E. Weippl (Eds.), Machine Learning and
Knowledge Extraction (pp. 295–303). Springer International Publishing.
Graziani, M., Andrearczyk, V., Marchand-Maillet, S., & Müller, H. (2020). Concept
Attribution: Explaining CNN Decisions to Physicians. Computers in Biology
and Medicine, 123, 103865.
Guidotti, R., Monreale, A., Giannotti, F., Pedreschi, D., Ruggieri, S., & Turini, F.
(2019). Factual and Counterfactual Explanations for Black Box Decision
Making. IEEE Intelligent Systems, 34 (6), 14–23.
Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi,
D. (2019). A Survey of Methods for Explaining Black Box Models. ACM
Computing Surveys, 51 (5), 1–42.
Gulum, M. A., Trombley, C. M., & Kantardzic, M. (2021). A Review of Explainable
Deep Learning Cancer Detection Models in Medical Imaging. Applied
Sciences, 11 (10), 4573.
166
Paper B
167
XAI for Enhancing Transparency in DSS
168
Paper B
Lamy, J.-B., Sekar, B., Guezennec, G., Bouaud, J., & Séroussi, B. (2019). Explainable
Artificial Intelligence for Breast Cancer: A Visual Case-based Reasoning
Approach. Artificial Intelligence in Medicine, 94, 42–53.
Laugel, T., Lesot, M.-J., Marsala, C., Renard, X., & Detyniecki, M. (2018).
Comparison-Based Inverse Classification for Interpretability in Machine
Learning. In J. Medina, M. Ojeda-Aciego, J. L. Verdegay, D. A. Pelta, I. P.
Cabrera, B. Bouchon-Meunier, & R. R. Yager (Eds.), Information Processing
and Management of Uncertainty in Knowledge-Based Systems. Theory and
Foundations (pp. 100–111). Springer International Publishing.
Lauritsen, S. M., Kristensen, M., Olsen, M. V., Larsen, M. S., Lauritsen, K. M.,
Jørgensen, M. J., Lange, J., & Thiesson, B. (2020). Explainable Artificial
Intelligence Model to Predict Acute Critical Illness from Electronic Health
Records. Nature Communications, 11 (1), 3852.
Le, T., Wang, S., & Lee, D. (2020). GRACE: Generating Concise and Informative
Contrastive Sample to Explain Neural Network Model’s Prediction.
Proceedings of the 26th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (KDD), 238–248.
Letham, B., Rudin, C., McCormick, T. H., & Madigan, D. (2015). Interpretable
Classifiers using Rules and Bayesian Analysis: Building a Better Stroke
Prediction Model. The Annals of Applied Statistics, 9 (3), 1350–1371.
Li, Y., Wang, H., Dang, L. M., Nguyen, T. N., Han, D., Lee, A., Jang, I., & Moon,
H. (2020). A Deep Learning-Based Hybrid Framework for Object Detection
and Recognition in Autonomous Driving. IEEE Access, 8, 194228–194239.
Lin, Z., Lyu, S., Cao, H., Xu, F., Wei, Y., Samet, H., & Li, Y. (2020). HealthWalks:
Sensing Fine-grained Individual Health Condition via Mobility Data.
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous
Technologies, 4 (4), 138:1–138:26.
Lindsay, L., Coleman, S., Kerr, D., Taylor, B., & Moorhead, A. (2020). Explainable
Artificial Intelligence for Falls Prediction. In M. Singh, P. K. Gupta, V.
Tyagi, J. Flusser, T. Ören, & G. Valentino (Eds.), Advances in Computing
and Data Sciences (pp. 76–84). Springer.
Longo, L., Goebel, R., Lecue, F., Kieseberg, P., & Holzinger, A. (2020). Explainable
Artificial Intelligence: Concepts, Applications, Research Challenges and
Visions. In A. Holzinger, P. Kieseberg, A. M. Tjoa, & E. Weippl
(Eds.), Machine Learning and Knowledge Extraction (pp. 1–16). Springer
International Publishing.
Lorente, M. P. S., Lopez, E. M., Florez, L. A., Espino, A. L., Martínez, J. A. I.,
& de Miguel, A. S. (2021). Explaining Deep Learning-Based Driver Models.
Applied Sciences, 11 (8), 3321.
Loyola-Gonzalez, O. (2019). Black-Box vs. White-Box: Understanding Their
Advantages and Weaknesses From a Practical Point of View. IEEE Access,
7, 154096–154113.
Loyola-González, O. (2019). Understanding the Criminal Behavior in Mexico
City through an Explainable Artificial Intelligence Model. In L.
Martínez-Villaseñor, I. Batyrshin, & A. Marín-Hernández (Eds.), Advances
in Soft Computing (pp. 136–149). Springer International Publishing.
169
XAI for Enhancing Transparency in DSS
170
Paper B
171
XAI for Enhancing Transparency in DSS
Poyiadzi, R., Sokol, K., Santos-Rodriguez, R., De Bie, T., & Flach, P. (2020). FACE:
Feasible and Actionable Counterfactual Explanations. Proceedings of the
AAAI/ACM Conference on AI, Ethics, and Society, 344–350.
Preece, A., Harborne, D., Braines, D., Tomsett, R., & Chakraborty, S. (2018).
Stakeholders in Explainable AI. Proceedings of AAAI FSS-18: Artificial
Intelligence in Government and Public Sector.
Prifti, E., Chevaleyre, Y., Hanczar, B., Belda, E., Danchin, A., Clément, K., &
Zucker, J.-D. (2020). Interpretable and Accurate Prediction Models for
Metagenomics Data. GigaScience, 9 (3), giaa010.
Rai, A. (2020). Explainable AI: From Black Box to Glass Box. Journal of the Academy
of Marketing Science, 48 (1), 137–141.
Ramos-Soto, A., & Pereira-Fariña, M. (2018). Reinterpreting Interpretability for
Fuzzy Linguistic Descriptions of Data. In J. Medina, M. Ojeda-Aciego,
J. L. Verdegay, D. A. Pelta, I. P. Cabrera, B. Bouchon-Meunier, & R. R.
Yager (Eds.), Information Processing and Management of Uncertainty in
Knowledge-Based Systems. Theory and Foundations (pp. 40–51). Springer
International Publishing.
Rehse, J.-R., Mehdiyev, N., & Fettke, P. (2019). Towards Explainable Process
Predictions for Industry 4.0 in the DFKI-Smart-Lego-Factory. KI -
Künstliche Intelligenz, 33 (2), 181–187.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016a). “Why Should I Trust You?”:
Explaining the Predictions of Any Classifier. Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD), 1135–1144.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016b). Model-Agnostic Interpretability
of Machine Learning. 2016 ICML Workshop on Human Interpretability in
Machine Learning (WHI).
Ribeiro, M. T., Singh, S., & Guestrin, C. (2018). Anchors: High-Precision
Model-Agnostic Explanations. Proceedings of the AAAI Conference on
Artificial Intelligence, 32.
Rio-Torto, I., Fernandes, K., & Teixeira, L. F. (2020). Understanding the Decisions
of CNNs: An In-model Approach. Pattern Recognition Letters, 133, 373–380.
Riquelme, F., De Goyeneche, A., Zhang, Y., Niebles, J. C., & Soto, A. (2020).
Explaining VQA Predictions using Visual Grounding and a Knowledge Base.
Image and Vision Computing, 101, 103968.
Robnik-Šikonja, M., & Bohanec, M. (2018). Perturbation-Based Explanations of
Prediction Models. In J. Zhou & F. Chen (Eds.), Human and Machine
Learning: Visible, Explainable, Trustworthy and Transparent (pp. 159–175).
Springer International Publishing.
Rubio-Manzano, C., Segura-Navarrete, A., Martinez-Araneda, C., & Vidal-Castro,
C. (2021). Explainable Hopfield Neural Networks Using an Automatic
Video-Generation System. Applied Sciences, 11 (13), 5771.
Rudin, C. (2019). Stop Explaining Black Box Machine Learning Models for High
Stakes Decisions and Use Interpretable Models Instead. Nature Machine
Intelligence, 1 (5), 206–215.
Rutkowski, T., Łapa, K., & Nielek, R. (2019). On Explainable Fuzzy Recommenders
and their Performance Evaluation. International Journal of Applied
Mathematics and Computer Science, 29 (3), 595–610.
172
Paper B
Sabol, P., Sinčák, P., Magyar, J., & Hartono, P. (2019). Semantically Explainable
Fuzzy Classifier. International Journal of Pattern Recognition and Artificial
Intelligence, 33 (12), 2051006.
Samek, W., Binder, A., Montavon, G., Lapuschkin, S., & Müller, K.-R. (2017).
Evaluating the Visualization of What a Deep Neural Network Has Learned.
IEEE Transactions on Neural Networks and Learning Systems, 28 (11),
2660–2673.
Samek, W., & Müller, K.-R. (2019). Towards Explainable Artificial Intelligence. In
W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, & K.-R. Müller (Eds.),
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning
(pp. 5–22). Springer International Publishing.
Sarathy, N., Alsawwaf, M., & Chaczko, Z. (2020). Investigation of an Innovative
Approach for Identifying Human Face-Profile Using Explainable Artificial
Intelligence. 2020 IEEE 18th International Symposium on Intelligent
Systems and Informatics (SISY), 155–160.
Sarp, S., Kuzlu, M., Cali, U., Elma, O., & Guler, O. (2021). An Interpretable Solar
Photovoltaic Power Generation Forecasting Approach Using An Explainable
Artificial Intelligence Tool. 2021 IEEE Power & Energy Society Innovative
Smart Grid Technologies Conference (ISGT), 1–5.
Schönhof, R., Werner, A., Elstner, J., Zopcsak, B., Awad, R., & Huber, M. (2021).
Feature Visualization within an Automated Design Assessment Leveraging
Explainable Artificial Intelligence Methods. Procedia CIRP, 100, 331–336.
Schorr, C., Goodarzi, P., Chen, F., & Dahmen, T. (2021). Neuroscope: An
Explainable AI Toolbox for Semantic Segmentation and Image Classification
of Convolutional Neural Nets. Applied Sciences, 11 (5), 2199.
Segura, V., Brandão, B., Fucs, A., & Vital Brazil, E. (2019). Towards Explainable AI
Using Similarity: An Analogues Visualization System. In A. Marcus & W.
Wang (Eds.), Design, User Experience, and Usability. User Experience in
Advanced Technological Environments (pp. 389–399). Springer International
Publishing.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2020).
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based
Localization. International Journal of Computer Vision, 128 (2), 336–359.
Senatore, R., Della Cioppa, A., & Marcelli, A. (2019). Automatic Diagnosis of
Neurodegenerative Diseases: An Evolutionary Approach for Facing the
Interpretability Problem. Information, 10 (1), 30.
Serradilla, O., Zugasti, E., Cernuda, C., Aranburu, A., de Okariz, J. R., &
Zurutuza, U. (2020). Interpreting Remaining Useful Life Estimations
Combining Explainable Artificial Intelligence and Domain Knowledge in
Industrial Machinery. 2020 IEEE International Conference on Fuzzy Systems
(FUZZ-IEEE), 1–8.
Shalaeva, V., Alkhoury, S., Marinescu, J., Amblard, C., & Bisson, G. (2018).
Multi-Operator Decision Trees for Explainable Time-Series Classification.
In J. Medina, M. Ojeda-Aciego, J. L. Verdegay, D. A. Pelta, I. P. Cabrera,
B. Bouchon-Meunier, & R. R. Yager (Eds.), Information Processing and
Management of Uncertainty in Knowledge-Based Systems. Theory and
Foundations (pp. 86–99). Springer International Publishing.
173
XAI for Enhancing Transparency in DSS
Soares, E., Angelov, P., & Gu, X. (2020). Autonomous Learning Multiple-Model
Zero-order Classifier for Heart Sound Classification. Applied Soft Computing,
94, 106449.
Sokol, K., & Flach, P. (2020). Explainability Fact Sheets: A Framework for Systematic
Assessment of Explainable Approaches. Proceedings of the 2020 Conference
on Fairness, Accountability, and Transparency (FAT*), 56–67.
Spinner, T., Schlegel, U., Schäfer, H., & El-Assady, M. (2020). explAIner: A
Visual Analytics Framework for Interactive and Explainable Machine
Learning. IEEE Transactions on Visualization and Computer Graphics,
26 (1), 1064–1074.
Štrumbelj, E., & Kononenko, I. (2014). Explaining Prediction Models and Individual
Predictions with Feature Contributions. Knowledge and Information
Systems, 41 (3), 647–665.
Sun, K. H., Huh, H., Tama, B. A., Lee, S. Y., Jung, J. H., & Lee, S. (2020).
Vision-Based Fault Diagnostics Using Explainable Deep Learning With Class
Activation Maps. IEEE Access, 8, 129169–129179.
Tabik, S., Gómez-Ríos, A., Martín-Rodríguez, J. L., Sevillano-García, I., Rey-Area,
M., Charte, D., Guirado, E., Suárez, J. L., Luengo, J., Valero-González,
M. A., García-Villanova, P., Olmedo-Sánchez, E., & Herrera, F. (2020).
COVIDGR Dataset and COVID-SDNet Methodology for Predicting
COVID-19 Based on Chest X-Ray Images. IEEE Journal of Biomedical and
Health Informatics, 24 (12), 3595–3605.
Tan, R., Khan, N., & Guan, L. (2020). Locality Guided Neural Networks for
Explainable Artificial Intelligence. Proceedings of the 2020 International
Joint Conference on Neural Networks (IJCNN).
ten Zeldam, S., Jong, A. D., Loendersloot, R., & Tinga, T. (2018). Automated Failure
Diagnosis in Aviation Maintenance Using eXplainable Artificial Intelligence
(XAI). Proceedings of the 4th European Conference of the PHM Society
(PHME).
Tintarev, N., Rostami, S., & Smyth, B. (2018). Knowing the Unknown: Visualising
Consumption Blind-spots in Recommender Systems. Proceedings of the 33rd
Annual ACM Symposium on Applied Computing (SAC), 1396–1399.
van der Waa, J., Nieuwburg, E., Cremers, A., & Neerincx, M. (2021). Evaluating XAI:
A Comparison of Rule-based and Example-based Explanations. Artificial
Intelligence, 291, 103404.
van der Waa, J., Schoonderwoerd, T., van Diggelen, J., & Neerincx, M.
(2020). Interpretable Confidence Measures for Decision Support Systems.
International Journal of Human-Computer Studies, 144, 102493.
van Lent, M., Fisher, W., & Mancuso, M. (2004). An Explainable Artificial
Intelligence System for Small-unit Tactical Behavior. Proceedings of the
16th Conference on Innovative Applications of Artifical Intelligence (IAAI),
900–907.
Vilone, G., & Longo, L. (2020). Explainable Artificial Intelligence: A Systematic
Review. ArXiv, (arXiv:2006.00093v4 [cs.AI]).
Vilone, G., & Longo, L. (2021a). Classification of Explainable Artificial Intelligence
Methods through Their Output Formats. Machine Learning and Knowledge
Extraction, 3 (3), 615–661.
174
Paper B
175
XAI for Enhancing Transparency in DSS
Zhao, G., Fu, H., Song, R., Sakai, T., Chen, Z., Xie, X., & Qian, X. (2019).
Personalized Reason Generation for Explainable Song Recommendation.
ACM Transactions on Intelligent Systems and Technology, 10 (4), 41:1–41:21.
Zheng, Q., Delingette, H., & Ayache, N. (2019). Explainable Cardiac Pathology
Classification on Cine MRI with Motion Characterization by Semi-supervised
Learning of Apparent Flow. Medical Image Analysis, 56, 80–95.
Zhong, Q., Fan, X., Luo, X., & Toni, F. (2019). An Explainable Multi-attribute
Decision Model based on Argumentation. Expert Systems with Applications,
117, 42–61.
176
Paper C
Abstract
Numerous studies have exploited the potential of Artificial Intelligence (AI)
and Machine Learning (ML) models to develop intelligent systems in diverse
domains for complex tasks, such as analysing data, extracting features,
prediction, recommendation etc. However, presently these systems embrace
acceptability issues from the end-users. The models deployed at the back
of the systems mostly analyse the correlations or dependencies between
the input and output to uncover the important characteristics of the input
features, but they lack explainability and interpretability that causing the
acceptability issues of intelligent systems and raising the research domain
of eXplainable Artificial Intelligence (XAI). In this study, to overcome these
shortcomings, a hybrid XAI approach is developed to explain an AI/ML
model’s inference mechanism as well as the final outcome. The overall
approach comprises of 1) a convolutional encoder that extracts deep features
from the data and computes their relevancy with features extracted using
domain knowledge, 2) a model for classifying data points using the features
from autoencoder, and 3) a process of explaining the model’s working
procedure and decisions using mutual information to provide global and
local interpretability. To demonstrate and validate the proposed approach,
experimentation was performed using an electroencephalography dataset
from road safety to classify drivers’ in-vehicle mental workload. The outcome
of the experiment was found to be promising that produced a Support
Vector Machine classifier for mental workload with approximately 89%
performance accuracy. Moreover, the proposed approach can also provide
† © 2021 IEEE. Reprinted, with permission, from Islam, M. R., Ahmed, M. U., & Begum,
S. (2021). Local and Global Interpretability using Mutual Information in Explainable Artificial
Intelligence. Proceedings of the 8th International Conference on Soft Computing & Machine
Intelligence (ISCMI), 191–195.
179
XAI for Enhancing Transparency in DSS
an explanation for the classifier model’s behaviour and decisions with the
combined illustration of Shapely values and mutual information.
Keywords: Autoencoder · Electroencephalography · Explainability ·
Feature Extraction · Mental Workload · Mutual Information.
C.1 Introduction
Recent developments of Artificial Intelligence (AI) and Machine Leaning (ML) have
been embraced in almost every domain in the form of automated and semi-automated
systems. However, with the growing popularity of these systems, the AI/ML
algorithms which act behind the systems, still endure acceptability issues due to
the lack of explanations on the algorithms’ inference mechanism and decisions.
Realising the dire need of explaining or interpreting AI/ML model-based intelligent
systems, the research domain of eXplainable Artificial Intelligence (XAI) emerged.
Currently, XAI research is immensely spreading to develop methods of generating
explanations to enhance the local and global interpretability of AI/ML models.
Global interpretability refers to interpreting any model’s inference mechanism,
whereas local interpretability indicates the understandability of a specific decision
from an AI/ML model (Guidotti et al., 2019). Several tools are already proposed
by researchers to generate explanations and interpretability of AI/ML models, such
as Local Interpretable Model Agnostic Explanations (LIME) (Ribeiro et al., 2016)
and SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2017). However,
the understandability of the explanations from these tools are highly dependent on
domain expertise.
Many fields from diverse domains have already been facilitated by XAI research,
such as, image processing (Wu et al., 2020), anomaly detection (Antwarg et al., 2021),
predictive maintenance (Serradilla et al., 2021) etc. On the contrary, safety-critical
domains concerning human life, e.g., road safety has received less attention from
the XAI researchers. Very few evidences are found in the literature like explaining
motorbike riding pattern (Leyli abadi & Boubezoul, 2021), whereas the depth of
research in XAI is still shallow for drivers. However, AI/ML approaches had been well
investigated for in-vehicle road safety features such as, drivers’ drowsiness detection
and intelligent speed assistance through utilising vehicular signals, neurophysiological
signals, etc. Specifically, neurophysiological signals, e.g. electroencephalography
(EEG) and electrocardiography (ECG), are one of the major tools for assessing
a driver’s in-vehicle performance (Borghini et al., 2014). The major challenge of
utilising EEG signals in an AI/ML approach is the feature extraction procedure
that demands high involvement of experts and manual computation. Automatic
approaches are already proved to be efficient in extracting features from EEG
leveraging the computation strength of convolutional neural network (CNN) based
autoencoder (Islam et al., 2019) but lacks in explainability of the extracted features.
Autoencoders of different architectures have been exploited in several studies to
explain diverse tasks, like forecasting energy demand (Kim & Cho, 2021), classifying
time series (Leonardi et al., 2020), detecting anomalies (Antwarg et al., 2021) and
changes in temporal images (Bergamasco et al., 2020), etc. Moreover, autoencoder
has been used to enhance the quality of explanations from different explainability
tools (Shankaranarayana & Runje, 2019). All of these works contribute to explain
180
Paper C
decisions or enhance explanations but no evidence was found on explaining the deep
features that can be extracted using autoencoder.
One of the major challenges of explaining model and/or decision is to extract
the underlying relation between the input and output. Recently, the concept of
mutual information has drawn attention of XAI researchers due to its naive nature of
quantifying relevancy between two random variables (Cover & Thomas, 2006). Upon
realising the potential of mutual information and the urge of explaining features
to induce global interpretability in AI/ML models, this study proposes a hybrid
approach of feature explanation using mutual information associating the explanation
generated by popular explainability tool SHAP. The idea solely depends on the fact
that the mutual information is a proper mean of domain knowledge as demonstrated
in several recent studies on recommender systems (Noshad et al., 2021), automated
fault diagnosis (Luo et al., 2019), feature extraction (Islam et al., 2020) etc.
Summarising, to expand the research domain of XAI and contribute to road safety,
this study aims at utilising the EEG signals recorded from car drivers’ to demonstrate
the proposed concept of explaining autoencoder extracted features using mutual
information, followed by explaining mental workload classification to achieve local
and global interpretability. To achieve the aim of this study, two major objectives
are set and stated below:
The remaining parts of this article are arranged as follows, Section C.2 contains
description of the materials and methods. In Section C.3, obtained results are
presented discussed thoroughly. Finally, conclusion of this study and possible
research directions are stated in Section C.4.
181
XAI for Enhancing Transparency in DSS
To record EEG signals, the digital monitoring BEmicro system (EBNeuro, Italy)
was used with active 15 EEG channels (F P z, AF 3, AF 4, F 3, F z, F 4, P 5, P 3, P z,
P 4, P 6, P Oz, O1, Oz and O2) placed according to the 10 − 20 International System.
The sampling frequency was 256Hz and the channel impedance was kept below
20kΩ. During the experiments raw EEG signals were recorded and the processing
was applied offline. In particular, each EEG signal has been firstly band-pass filtered
with a fourth-order Butterworth infinite impulse response (IIR) filter (high-pass filter
cut-off frequency: 1Hz, low-pass filter cut-off frequency: 30Hz). Afterwards, ARTE
(Automated aRTifacts handling in EEG) algorithm (Barua et al., 2018) was deployed
to remove various artefacts such as, drivers’ movements and environmental noises,
from the recorded EEG signals. Finally, the EEG signals were sliced into epochs
of 2s (0.5Hz of the frequency resolution) length using sliding window technique
with a stride of 0.125s keeping an overlap of 1.875s between two continuous epochs.
The windowing technique was performed to obtain higher number of observations in
comparison with the number of variable and to contain the stationarity condition of
the EEG signals (Elul, 1969).
182
Paper C
reconstruct the input signal from the features through minimising the residuals. The
autoencoder trains through the process of encoding and reconstruction of predefined
epochs and batch size. Here, several tweaking of the number of convolutional layers
and associated parameters were performed and the encoder was finalised with three
convolutional layers and three max-pooling layers followed by a flattening layer.
Table C.1 presents the summary of the layers of the encoder with a total of 732
parameters to train. The output shape of of the input layer is (512, 16, 1) that
contains 1 clean EEG signal epoch of length 2s (at 256Hz sampling frequency) from
15 channels and one channel was introduced with zeros to facilitate the design of the
encoder. The decoder was designed in the inverse order of the structure of the encoder
containing four convolutional layers and three upsampling layers facilitating the
depooling mechanism. In each of the convolutional layers, batch normalisation with
ReLU activation function was invoked with zero padding. The developed autoencoder
utilised RMSprop optimisation with a learning rate of 0.002 and binary cross-entropy
as the loss function. Finally, 32 features were extracted from the cleaned EEG epochs
in accordance to the output shape of the flattening layer of the encoder.
After the preparation of feature sets, labels were added to the feature vectors
according to the experimental road segment and time of driving based on the
experimental design. Specifically, the feature vectors extracted from driving sessions
on hard road segment during rush hour was labelled as high mental workload. On
the other hand, low mental workload labels were added to the features extracted
from the data recorded during normal hour while driving on easier road segment as
prescribed by the experts in the experimental protocol (Di Flumeri et al., 2018; Islam
et al., 2020).
183
XAI for Enhancing Transparency in DSS
(Luo et al., 2019). In fact, a recent study showed the use of mutual information in
developing combined feature set from correlated features from different measurements
(Islam et al., 2020).
Theoretically, If X and Y are continuous random variables where X, Y ∈ Rd , the
mutual information between X and Y is termed as I(X, Y ) and formulated as shown
in Equation C.1 (Cover & Thomas, 2006).
Z Z
p(y, x)
I(X, Y ) = p(y, x) log2 dx dy (C.1)
y x p(y) p(x)
In this study, Fs and Fa were considered for spectral and autoencoder extracted
features, respectively depicting X and Y as stated in Equation C.1. Thus, computing
the mutual information I(Fs , Fa ) generates the means of explaining the autoencoder
extracted features by the spectral features as a substitute of domain knowledge.
Afterwards, for better understanding of the explanation, the mutual information
values are illustrated using Chord diagram (Tintarev et al., 2018) for the whole
model or a single decision.
184
Paper C
The outcome of this study is presented in this section from two different aspects-
mental workload classification and explaining the trained classifier model followed
by explaining a single decision. For each of the aspects, the results are discussed in
corresponding subsections.
185
XAI for Enhancing Transparency in DSS
Figure C.1: Global explanation of mental workload classifier model with SHAP values
with bar plot (left) and mutual information illustrated with Chord diagrams for six
spectral feature groups (right). © 2021 IEEE.
study. But, the difficulty of understanding the Shapley values associated with the
autoencoder extracted features for the end users had been overcome using spectral
feature groups of EEG signals. Mutual information values, in naive term, relevancy
between the spectral features and autoencoder features were calculated followed by
the representation of Chord diagrams to facilitate the domain experts.
C.4 Conclusion
The contribution presented in this article is twofold: (1) proposal and illustration of
a novel approach of using mutual information to explain EEG features extracted by
convolutional autoencoder; this approach, to our knowledge, is the only procedure to
explain the autoencoder extracted features; (2) demonstration of explaining drivers’
mental workload classification at local and global scope, based on autoencoder
extracted EEG features using SHAP and mutual information. In a broader terms,
explaining EEG signal classification that can be further adopted in other domains
utilising the EEG signals.
The experimental results of this study have been encouraging, but there is
space for improvements and further research. In terms of deep learning techniques,
investigating other architectures, such as Recurrent Neural Network (RNN) as a
combined alternative to the working sequence of autoencoder and RF or SVM
186
Paper C
Bibliography
Antwarg, L., Miller, R. M., Shapira, B., & Rokach, L. (2021). Explaining Anomalies
Detected by Autoencoders using Shapley Additive Explanations. Expert
Systems with Applications, 186, 115736.
Barua, S., Ahmed, M. U., Ahlstrom, C., Begum, S., & Funk, P. (2018). Automated
EEG Artifact Handling With Application in Driver Monitoring. IEEE
Journal of Biomedical and Health Informatics, 22 (5), 1350–1361.
Bergamasco, L., Saha, S., Bovolo, F., & Bruzzone, L. (2020). An Explainable
Convolutional Autoencoder Model for Unsupervised Change Detection. The
International Archives of the Photogrammetry, Remote Sensing and Spatial
Information Sciences, XLIII-B2-2020, 1513–1519.
Borghini, G., Astolfi, L., Vecchiato, G., Mattia, D., & Babiloni, F. (2014). Measuring
Neurophysiological Signals in Aircraft Pilots and Car Drivers for the
Assessment of Mental Workload, Fatigue and Drowsiness. Neuroscience &
Biobehavioral Reviews, 44, 58–75.
Corcoran, A. W., Alday, P. M., Schlesewsky, M., & Bornkessel-Schlesewsky, I. (2018).
Toward a Reliable, Automated Method of Individual Alpha Frequency (IAF)
Quantification. Psychophysiology, 55 (7), e13064.
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.).
John Wiley & Sons, Inc.
Di Flumeri, G., Borghini, G., Aricò, P., Sciaraffa, N., Lanzi, P., Pozzi, S., Vignali, V.,
Lantieri, C., Bichicchi, A., Simone, A., & Babiloni, F. (2018). EEG-Based
Mental Workload Neurometric to Evaluate the Impact of Different Traffic
and Road Conditions in Real Driving Settings. Frontiers in Human
Neuroscience, 12, 509.
Elul, R. (1969). Gaussian Behavior of the Electroencephalogram: Changes during
Performance of Mental Task. Science, 164 (3877), 328–331.
Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi,
D. (2019). A Survey of Methods for Explaining Black Box Models. ACM
Computing Surveys, 51 (5), 1–42.
Kim, J.-Y., & Cho, S.-B. (2021). Explainable Prediction of Electric Energy Demand
using a Deep Autoencoder with Interpretable Latent Space. Expert Systems
with Applications, 186, 115842.
187
XAI for Enhancing Transparency in DSS
Leonardi, G., Montani, S., & Striani, M. (2020). Deep Feature Extraction for
Representing and Classifying Time Series Cases: Towards an Interpretable
Approach in Haemodialysis. Proceedings of the Thirty-Third International
Florida Artificial Intelligence Research Society Conference (FLAIRS),
417–420.
Leyli abadi, M., & Boubezoul, A. (2021). Deep Neural Networks for Classification of
Riding Patterns: With a Focus on Explainability. Proceedings of the European
Symposium on Artificial Neural Networks, Computational Intelligence and
Machine Learning (ESANN), 481–486.
Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model
Predictions. Proceedings of the 31st International Conference on Neural
Information Processing Systems (NeurIPS), 4768–4777.
Luo, X., Li, X., Wang, Z., & Liang, J. (2019). Discriminant Autoencoder for Feature
Extraction in Fault Diagnosis. Chemometrics and Intelligent Laboratory
Systems, 192, 103814.
Noshad, Z., Bouyer, A., & Noshad, M. (2021). Mutual Information-based
Recommender System using Autoencoder. Applied Soft Computing, 109,
107547.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”:
Explaining the Predictions of Any Classifier. Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD), 1135–1144.
Serradilla, O., Zugasti, E., Ramirez de Okariz, J., Rodriguez, J., & Zurutuza, U.
(2021). Adaptable and Explainable Predictive Maintenance: Semi-Supervised
Deep Learning for Anomaly Detection and Diagnosis in Press Machine Data.
Applied Sciences, 11 (16), 7376.
Shankaranarayana, S. M., & Runje, D. (2019). ALIME: Autoencoder Based
Approach for Local Interpretability. In H. Yin, D. Camacho, P. Tino, A. J.
Tallón-Ballesteros, R. Menezes, & R. Allmendinger (Eds.), Intelligent Data
Engineering and Automated Learning – IDEAL 2019 (pp. 454–463). Springer
International Publishing.
Tintarev, N., Rostami, S., & Smyth, B. (2018). Knowing the Unknown: Visualising
Consumption Blind-spots in Recommender Systems. Proceedings of the 33rd
Annual ACM Symposium on Applied Computing (SAC), 1396–1399.
Wu, S.-L., Tung, H.-Y., & Hsu, Y.-L. (2020). Deep Learning for Automatic Quality
Grading of Mangoes: Methods and Insights. 2020 19th IEEE International
Conference on Machine Learning and Applications (ICMLA), 446–453.
188
Paper D
D
Paper D
Abstract
Understanding individual car drivers’ behavioural variations and
heterogeneity is a significant aspect of developing car simulator technologies,
which are widely used in transport safety. This also characterizes the
heterogeneity in drivers’ behaviour in terms of risk and hurry, using
both real-time on-track and in-simulator driving performance features.
Machine learning (ML) interpretability has become increasingly crucial for
identifying accurate and relevant structural relationships between spatial
events and factors that explain drivers’ behaviour while being classified
and the explanations for them are evaluated. However, the high predictive
power of ML algorithms ignore the characteristics of non-stationary domain
relationships in spatiotemporal data (e.g., dependence, heterogeneity),
which can lead to incorrect interpretations and poor management decisions.
This study addresses this critical issue of ‘interpretability’ in ML-based
modelling of structural relationships between the events and corresponding
features of the car drivers’ behavioural variations. In this work, an
exploratory experiment is described that contains simulator and real
driving concurrently with a goal to enhance the simulator technologies.
Here, initially, with heterogeneous data, several analytic techniques for
simulator bias in drivers’ behaviour have been explored. Afterwards, five
different ML classifier models were developed to classify risk and hurry in
drivers’ behaviour in real and simulator driving. Furthermore, two different
† © 2023 by SCITEPRESS (CC BY-NC-ND 4.0). Reprinted, with permission, from Islam,
M. R., Ahmed, M. U., & Begum, S. (2023). Interpretable Machine Learning for Modelling and
Explaining Car Drivers’ Behaviour: An Exploratory Analysis on Heterogeneous Data. Proceedings
of the 15th International Conference on Agents and Artificial Intelligence (ICAART), 392–404.
191
XAI for Enhancing Transparency in DSS
D.1 Introduction
Artificial Intelligence (AI) and Machine Learning (ML) models are the basis of
intelligent systems and continuously gaining popularity across diverse domains. The
prime reason behind the models’ growing popularity is the outstanding and accurate
computation of features and the prediction based on the features. Among the AI/ML
facilitated domains, the transportation domain is notably using different models
within the framework of driving simulators. Driving simulators are increasingly
adopted in different countries for diverse objectives, e.g., driver training, road safety,
etc. (Sætren et al., 2019).
In conjunction with the increased demands on explanations for the decisions
of AI/ML models in other domains, the need for explanation is also rising for
the automated actions in the simulators. However, different fields from other
domains are already facilitated with the eXplainable AI (XAI) research, e.g., anomaly
detection (Antwarg et al., 2021), predictive maintenance (Serradilla et al., 2021),
image processing (Wu et al., 2020) etc. conversely, road safety related simulator
development and enhancement have been less exploited in XAI research. Though
there are very few studies are available in the literature that explained the riding
patterns of motorbikes (Leyli abadi & Boubezoul, 2021), explaining drivers’ fatigue
prediction (Zhou et al., 2022), etc., research studies on drivers’ behaviours are scarce
in terms of XAI. In addition, the research on the evaluation of explanations for the
predictions or decisions of an AI/ML model is also in nurturing state.
Realising the need for research to enhance the simulation technologies and the
complementary requirement for the development of the explanation models this
research study was conducted. The main objective of the work presented in this
paper can be outlined as-
• Explore the variation of drivers’ behaviour in the simulator and track driving
to enhance the simulator technologies.
• Develop classifiers for drivers’ behaviour in terms of risk and hurry while
driving.
• Explain the decisions of drivers’ behaviour classifiers and evaluate the
explanations.
192
Paper D
The remaining sections of this paper are organised as follows: Section D.2
introduces the materials and methodologies used in this study. The results and
corresponding discussions on the findings are presented in Section D.3. Finally,
Section D.4 contains the concluding remarks and directions for future research works.
Figure D.1: The experimental route for simulation and track tests. A detailed
description is presented in Section D.2.1. © 2023 by SCITEPRESS (CC BY-NC-ND
4.0).
193
XAI for Enhancing Transparency in DSS
Figure D.2: The car simulator developed with DriverSeat 650 ST was used for
conducting the simulation tests. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
aim, the experiment was planned with the simulator and track driving tests. In both
the simulation and track tests, participant drivers were required to drive along the
identical route for seven laps with different variables. This design further facilitated
the analysis of varying behaviour while driving on track and simulation. The route of
the experiment is illustrated in Figure D.1. For the track test, the route was prepared
with proper road markings, signals etc. in an old airport in Kraków, Poland. In
simulation tests, a modified variant of DriverSeat 650 ST (Figure D.2) simulation
cockpit was used. As annotated in Figure D.1, each participant started the lap
from point A, drove straight up to the roundabout at point B, took the third exit
of the roundabout, drove up to point C to take a right turn, drove straight up to
point D then took a U-turn and came back to point C for a left turn and then
drove through points B (roundabout), E (right turn), C (left turn) and finishes at
point F after a left curve. For the simulation test, a similar route was designed
virtually where the participants drove following the same protocol. In both tests, a
participant drove through the route for seven laps with different scenarios containing
varied environmental and driver variables as outlined in Table D.1. The scenarios
associated with the laps were designed with the consultation of psychologists and
domain experts.
194
Paper D
Table D.1: Associated scenarios for the laps of the experimental simulator and track
driving with varying driving conditions. © 2023 by SCITEPRESS (CC BY-NC-ND
4.0).
Environmental
Driver Variables
Lap Variables Scenario
Events Traffic Habituation Hurry Frustration Surprise
1 Round- Low No No No
2 about, No Low No No No Drive along the
3 Left High No No No route.
Turn,
4 Yes High No No No
Inter-
5 section No High Yes No No Drive along the route
with no and finish as quickly
6 Yes High Yes Yes No as possible.
Traffic
7 Lights No High No No Yes Drive along the route.
195
XAI for Enhancing Transparency in DSS
levels of risk. In risky situations, prime events were short-listed by experts including
roundabouts, left turns, extensive breaking/acceleration, etc. As per the experts’
opinion, the events were defined based on the road infrastructure. To label the
acquired data, all the GPS coordinates were plotted and overlaid on the experimental
track to identify the specific GPS coordinates where an event could occur. Figure
D.3 illustrates the event extraction from GPS coordinates using overlaid scatter
plot. Considering the GPS coordinates within the red rectangles in Figure D.3 and
consulting with domain experts and psychologists the data points were complemented
with corresponding events. Figure D.4 illustrates the recorded GPS coordinates of a
single lap categorised on the basis of road infrastructure as events in different colours.
The extracted events are further discussed in Section D.3.1.
Figure D.3: Event extraction using GPS coordinates. Red rectangles mark the
significant areas of events, e.g., roundabout, left turn, signal with pedestrian crossing
etc. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
196
Paper D
Figure D.4: GPS coordinates of a single lap driving colour-coded with respect to
different road structures. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
pedal position and steering wheel angle. The average and standard deviation of
these measures were were calculated within the start and end time of the events
annotated by the experts. These features were gathered in the feature list including
the maximum value for speed only resulting in 7 features. From IMU, the parameters
for angular and linear acceleration were considered and 9 features were calculated.
All the features extracted from the vehicular signals are listed in Table D.2.
Table D.2: List of features extracted from vehicular signals. © 2023 by SCITEPRESS
(CC BY-NC-ND 4.0).
197
XAI for Enhancing Transparency in DSS
From the curated EEG signals, 14 frequency domain features were extracted
from the power spectral density values. At first, the Individual Alpha Frequency
(IAF) (Corcoran et al., 2018) values were estimated as the peak of the general alpha
rhythm frequency (8 − 12Hz). Eventually, the average frequency of the theta band
[IAF −6, IAF −2], alpha band [IAF −2, IAF +2] and beta band [IAF +2, IAF +18],
over all the aforementioned EEG channels were calculated. Next, the channels were
partitioned on the basis of frontal and parietal locations on the scalp. For alpha and
beta bands, frontal and parietal parts were again divided into two segments; upper
and lower. For each of the segments, the average values of the frequency bands
were considered as a feature, thus, obtaining a total of fourteen biometric features.
Table D.3 presents the list of the extracted biometric features that have been further
deployed in classification tasks.
Table D.3: List of biometric features considering different frequency bands of EEG
signal. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
198
Paper D
Table D.4: Summary of the datasets from the simulator and track experiments for risk
and hurry classification. The values represent the number of instances for corresponding
labels of the classification tasks before applying SMOTE. © 2023 by SCITEPRESS
(CC BY-NC-ND 4.0).
Experiment
Classification Label Total
Simulation Track
Yes 330 215 545
Risk
No 696 530 1226
Yes 201 19 220
Hurry
No 825 726 1551
Total Instance 1026 745 1771
199
XAI for Enhancing Transparency in DSS
Table D.5: Parameters used in tuning different AI/ML models for classifying risk and
hurry in driving behaviour with 5-fold cross validation. The parameters used for final
training are highlighted in bold font. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
as f that estimates the importance w of each feature to the prediction. That is, for
a given classifier model c and a data point xi , f (c, xi ) = ω ∈ Rm . Here, each ωj
refers to the relative importance of feature j for the prediction c(xi ). Among the
feature attribution methods, Shapley Additive Explanations (SHAP) (Ribeiro et al.,
2016) and Local Interpretable Model-Agnostic Explanation (LIME) (Lundberg &
Lee, 2017) are exploited in this work as being popular choices in present research
works (Islam et al., 2022). Both the explanation models were built for GBDT and
Dt est to generate local and global explanations. TreeExplainer was invoked for SHAP
to complement the characteristics of GBDT and LIME was trained with default
settings from the corresponding library.
D.2.5 Evaluation
The evaluation of the presented work has been performed in two folds: evaluating
the performance of the classification models in classifying risk and hurry in drivers’
behaviour and evaluating the feature attribution using SHAP & LIME to explain
the classification. The metrics used for both evaluations are briefly described in the
following subsections.
200
Paper D
matrix, True Positive (TP) and False Negative (FN) are the numbers of correct and
wrong predictions respectively for the positive class, i.e., Yes (1). On the other hand,
False Positive (FP) and True Negative (TN) are the numbers of wrong and correct
predictions respectively for the negative class, i.e., No (0).
Figure D.5: Confusion Matrix for both Risk and Hurry Classification. © 2023 by
SCITEPRESS (CC BY-NC-ND 4.0).
Figure D.6: Average driving velocity in different laps. The two-sided Wilcoxon
signed-rank test demonstrates a significant difference in the simulator and track driving
with t = 0.0, p = 0.0156. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
From both the analysis of driving velocity and accelerator pedal position, it was
evident that drivers tend to drive at a higher velocity and press the accelerator
pedal more in simulation tests than in track tests. This is plausibly the cause
of simulator bias. In naive terms, drivers do not experience the motion of the
vehicle, and perceive the environment properly, e.g., the vibration of the vehicle,
202
Paper D
Figure D.7: Average accelerator pedal position across all the laps and the two-sided
Wilcoxon signed-rank test demonstrate a significant difference in the simulator and
track driving with t = 0.0, p = 0.0156. © 2023 by SCITEPRESS (CC BY-NC-ND
4.0).
the effect of road structures, etc. The differences in the driving behaviour have
been properly addressed with corresponding experts and it is a work in progress
to reduce the simulation biases in future studies. Moreover, while deploying ML
algorithms to classify drivers’ behaviour, these characteristics from non-stationary
spatiotemporal data might lead to incorrect interpretations. To correctly assess the
effects or contribution of the heterogeneous features, two different methods of XAI
were evaluated and presented in Section D.3.3.
Figure D.8: GPS coordinates with varying driving velocity for a random participant
in laps 1 – 6. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
The driving velocity in each lap was also analysed based on different road
structures using scatter plots and heatmaps as illustrated in Figure D.8. In this
203
204
Table D.6: Performance measures of risky behaviour classification with the AI/ML models trained on the holdout test set of different
datasets. The best values for each metric and each dataset are highlighted in bold font. (Positive Class – Risk, Negative Class – No Risk ).
© 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
Simulation Dataset Track Dataset Combined Dataset
Metrics
GBDT LR MLP RF SVM GBDT LR MLP RF SVM GBDT LR MLP RF SVM
TP 105 82 86 23 100 106 88 103 56 105 229 186 226 233 228
FN 15 38 34 97 20 0 18 3 50 1 8 51 11 4 9
FP 16 26 42 0 14 3 24 5 0 4 30 62 45 26 23
TN 112 102 86 128 114 109 88 107 112 108 199 167 184 203 206
Precision 0.868 0.759 0.672 1.0 0.877 0.972 0.786 0.954 1.0 0.963 0.884 0.75 0.834 0.900 0.908
Recall 0.875 0.683 0.717 0.192 0.833 1.0 0.830 0.972 0.528 0.991 0.966 0.785 0.954 0.983 0.962
F1 score 0.871 0.719 0.694 0.322 0.855 0.986 0.807 0.963 0.691 0.977 0.923 0.767 0.89 0.940 0.934
XAI for Enhancing Transparency in DSS
Accuracy 87.50 74.19 69.36 60.89 86.29 98.62 80.73 96.33 77.06 97.71 91.85 75.75 87.98 93.56 93.13
Table D.7: Performance measures of hurry classification with the AI/ML models trained on the holdout test set of different datasets. The
best values for each metric and each dataset are highlighted in bold font. (Positive Class – Hurry, Negative Class – No Hurry). © 2023 by
SCITEPRESS (CC BY-NC-ND 4.0).
Simulation Dataset Track Dataset Combined Dataset
Metrics
GBDT LR MLP RF SVM GBDT LR MLP RF SVM GBDT LR MLP RF SVM
TP 92 90 61 110 84 70 66 56 81 68 145 130 137 143 149
FN 18 20 49 0 26 11 15 25 0 13 25 40 33 27 21
FP 8 22 25 90 10 13 25 18 59 9 24 75 41 31 33
TN 91 77 74 9 89 65 53 60 19 69 174 123 157 167 165
Precision 0.920 0.804 0.709 0.550 0.894 0.843 0.725 0.757 0.579 0.883 0.858 0.634 0.770 0.822 0.819
Recall 0.836 0.818 0.555 1.0 0.764 0.864 0.815 0.691 1.0 0.840 0.853 0.765 0.806 0.841 0.876
F1 score 0.876 0.811 0.622 0.710 0.824 0.854 0.767 0.723 0.733 0.861 0.855 0.693 0.787 0.831 0.847
Accuracy 87.56 79.90 64.59 56.94 82.78 84.91 74.84 72.96 62.89 86.16 86.69 68.75 79.89 84.23 85.33
Paper D
analysis, the seventh lap was excluded because of the presence of surprise which
reduced the data from driving the full lap. The pattern of driving velocity in laps 1
– 3 (Figure D.8a – D.8c) was found to be identical. The variation increased in laps 4
– 6 (Figure D.8d – D.8f) when several variables were added to the lap scenarios. The
illustrated driving patterns were cross-checked with psychologists’ assessments of the
participants and their conclusive drivers’ rules of behaviour. For example, on a left
turn, the behaviour of drivers can be stated as – “if the road is one carriageway, then
you have to gradually move on the left and look for cars coming from the opposite
direction before turning left”. In all the sub-figures of Figure D.8, it can be observed
that, at the left turn near longitude 500 and latitude 750, the driver slowed down to
examine oncoming vehicles and moved towards left before the turn as to road was
single carriageway by design. Another major observation can be found in lap 6 at
the lower middle of the circuit near longitude 550 and latitude 725 (Figure D.8f).
There was a signal with a pedestrian crossing and the driving velocity was close to
zero which indicates that the stop signal was lit or a pedestrian was crossing and
the driver responded to the signal. Thus, drivers’ behaviours at different events in
terms of road infrastructures were analysed and the observations were put forward
to respective experts for enhancing the quality of the agents in future simulators.
D.3.2 Classification
The classification of drivers’ behaviour was done in two folds; risk and hurry. It
is arguable that hurried driving can induce risk. On the contrary, hurriedness is
often observed among drivers who drive safely. Driving safely refers to specific
behaviours as an example is stated in Section D.3.1. Based on the drivers’ rules of
behaviour proposed by the experts, classifying risk and hurry are considered separate
tasks. The performance of the trained models on the holdout datasets for risk and
hurry classification are presented in Tables D.6 and D.7 respectively. In both tasks
apparently, GBDT excelled over other models. However, for all the datasets in both
tasks, simpler ones among the investigated models produced better performance. The
use of precision and recall was justified by the nature of the classification tasks which
mostly concentrate the measures on classifying the positive class. In this work, the
positive class was set to be the presence of risk and hurry in drivers’ behaviour
which is more important than classifying their absence. One notable behaviour
was observed for RF that it performed poorly when used on the simulation and
track dataset separately but on the combined dataset it produced the result for risk
classification. In the case of hurry classification, the behaviour was quite altered.
Due to this fluctuation in the performance across different datasets and tasks, RF
was not further utilized to develop the explanation models.
Table D.8 presents the best classifier for both risk and hurry classification across
the three datasets. It is observed that overall GBDT performed better in every
combination that lead to its use in the explanation generation. Moreover, to
accumulate all the characteristics of the data in the explanation model only the
combined dataset has been used further.
205
XAI for Enhancing Transparency in DSS
D.3.3 Explanation
Considering the prediction performance of GBDT across datasets and classification
tasks, explanation models SHAP and LIME were built to explain individual
predictions, i.e, local explanations. While explaining a single instance of prediction
from c both models mimic the inference mechanism of c to predict the instance within
their framework. The prediction performance of the explanation model was measured
with local accuracy described in Section D.2.5.2 and the values are presented in
Table D.9. It was observed that for both classification tasks, SHAP achieved higher
accuracy than LIME. Moreover, LIME performed very poorly in local predictions
for risk classification. However, both explanation models performed comparatively
poorer in terms of hurry classification.
Table D.9: Pairwise comparison of performance metrics for SHAP and LIME on
combined Xtest (holdout test set) for risk and hurry. For all the metrics, higher values
are better and highlighted in bold font. All the values for ρ are statistically significant
since p < 0.05. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
Risk Hurry
Metrics
SHAP LIME SHAP LIME
Accuracy 92.59% 52.98% 84.32% 70.06%
nDCGall 0.9561 0.8758 0.9588 0.9183
nDCGind 0.8717 0.8589 0.8671 0.8524
ρ 0.7664 0.5310 0.7059 0.4772,
p 7.91e−7 2.53e−3 1.31e−5 7.67e−3
206
Paper D
roundabout exits with pedestrian crossing, manoeuvring after a left turn, etc. In the
other classification task for hurry, the standard deviation of the accelerator pedal
position corresponds to a frequent pressing of the pedal with a varying intensity
which is plausibly an indication to hurry. Here, the concerned events are similar to
the events mentioned for risk.
Order Order
GBDT SHAP LIME GBDT SHAP LIME
14 15 15 15 9 9
9 10 14 17 21 25
20 22 24 22 20 17
6 9 25 13 11 10
13 24 30 25 22 26
15 20 21 14 25 24
10 6 9 6 24 21
8 16 20 7 10 29
3 2 5 11 7 20
21 19 19 2 8 6
12 27 28 10 6 4
1 1 1 9 4 3
7 4 6 3 3 8
19 17 13 1 1 1
11 18 12 18 26 14
4 7 11 16 23 13
22 8 8 8 18 23
18 11 7 19 14 5
28 29 27 20 13 12
17 13 4 12 15 16
24 21 18 27 19 19
27 30 29 23 28 22
30 26 23 28 16 7
29 23 16 30 29 30
23 28 26 26 17 15
2 5 2 5 5 11
5 3 3 21 12 18
16 12 10 4 2 2
25 14 22 24 27 27
26 25 17 29 30 28
Figure D.9: Feature importance values are extracted from GBDT, SHAP & LIME,
normalized and illustrated with horizontal bar charts for corresponding classification
tasks. The order of the features based on the importance values is presented in tables
on either side. Features with the same order across methods are highlighted in the
order tables. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
Several similar ranks of the features based on their contributions from both SHAP
and LIME motivated the comparison of nDCG scores that computes the similarity of
retrieved information. In this work, the retrieved information is the order of features
according to their importance values or contributions to prediction. The nDCG
scores were computed for all the instances together and also computed for individual
predictions and averaged. The rank of the features based on the normalized feature
importance from the base model GBDT was used as the reference while calculating
the nDCG score to assess how similar they are to the classifier model. Alike local
accuracy, SHAP produced better results than LIME in terms of nDCG score. To
investigate further, ρ was computed with a null hypothesis, ‘the rank of the features
in different methods are different’. However, with the test results, the hypothesis was
207
XAI for Enhancing Transparency in DSS
rejected as all the measurements came out to be statistically significant as the p value
was lower than 0.05. All the values of nDCG score and ρ are reported in Table D.9.
Another noteworthy aspect was observed from the metrics evaluating the explanation
models that SHAP produced better results for risk classification but the performance
of LIME was better for hurry classification. The performance of SHAP complements
the performance summary of the classification models presented in Table D.8 where
risk classification had better performance than hurry classification. It is also plausible
that, if the local accuracy of an explanation model is better, the rankings of the
attributed features are also more relevant which is evident in the corresponding nDCG
score and ρ values.
Figure D.10: Low fidelity prototype of proposed drivers’ behaviour monitoring system
for simulated driving. © 2023 by SCITEPRESS (CC BY-NC-ND 4.0).
208
Paper D
on the prediction and convey specific instructions to modify the drivers’ behaviour
to make their driving safer.
The work presented in this paper can be summarised in three aspects: i) comparative
analysis of car drivers’ behaviour in the simulator and track driving for different
traffic situations, ii) development of classifier models to detect risk or hurry in
drivers’ behaviour and iii) explaining the risk and hurry classification with feature
attribution techniques with a proposed system for drivers’ behaviour monitoring in
simulated driving. The first outcome is found to be a novel analysis that includes
experimentation with simulation and track driving. The second and third outcomes
can be concurrently utilised in enhancing the simulator techniques to train road users
for a safer traffic environment through the functional development of the proposed
drivers’ behaviour monitoring system.
The outcome of this study is encouraging in terms of explanation methods that
require further research. The lack of prescribed evaluation metrics in the literature
led to the use of different borrowed metrics from different concepts. However, the
results showed promising possibilities to enhance and modify them for future works on
the evaluation of explanation methods. Another possible research direction would be
to improve the feature attribution methods to produce more insightful explanations.
Bibliography
Antwarg, L., Miller, R. M., Shapira, B., & Rokach, L. (2021). Explaining Anomalies
Detected by Autoencoders using Shapley Additive Explanations. Expert
Systems with Applications, 186, 115736.
Barua, S., Ahmed, M. U., Ahlstrom, C., Begum, S., & Funk, P. (2018). Automated
EEG Artifact Handling With Application in Driver Monitoring. IEEE
Journal of Biomedical and Health Informatics, 22 (5), 1350–1361.
Busa-Fekete, R., Szarvas, G., Élteto, T., & Kégl, B. (2012). An Apple-to-apple
Comparison of Learning-to-rank Algorithms in terms of Normalized
Discounted Cumulative Gain. Proceedings of the Workshop on Preference
Learning: Problems and Applications in AI co-located with the 20th European
Conference on Artificial Intelligence (ECAI), 242.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002).
SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial
Intelligence Research, 16, 321–357.
Corcoran, A. W., Alday, P. M., Schlesewsky, M., & Bornkessel-Schlesewsky, I. (2018).
Toward a Reliable, Automated Method of Individual Alpha Frequency (IAF)
Quantification. Psychophysiology, 55 (7), e13064.
209
XAI for Enhancing Transparency in DSS
Letzgus, S., Wagner, P., Lederer, J., Samek, W., Muller, K.-R., & Montavon, G.
(2022). Toward Explainable Artificial Intelligence for Regression Models: A
Methodological Perspective. IEEE Signal Processing Magazine, 39 (4), 40–58.
Leyli abadi, M., & Boubezoul, A. (2021). Deep Neural Networks for Classification of
Riding Patterns: With a Focus on Explainability. Proceedings of the European
Symposium on Artificial Neural Networks, Computational Intelligence and
Machine Learning (ESANN), 481–486.
Liu, Y., Khandagale, S., Khandagale, S., White, C., & Neiswanger, W. (2021).
Synthetic Benchmarks for Scientific Research in Explainable Machine
Learning. In J. Vanschoren & S. Yeung (Eds.), Proceedings of the Neural
Information Processing Systems - Track on Datasets and Benchmarks
(NeurIPS Datasets and Benchmarks).
Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model
Predictions. Proceedings of the 31st International Conference on Neural
Information Processing Systems (NeurIPS), 4768–4777.
Oostenveld, R., Fries, P., Maris, E., & Schoffelen, J.-M. (2010). FieldTrip: Open
Source Software for Advanced Analysis of MEG, EEG, and Invasive
Electrophysiological Data. Computational Intelligence and Neuroscience,
2011, e156869.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011).
Scikit-learn: Machine Learning in Python. Journal of Machine Learning
Research, 12 (85), 2825–2830.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”:
Explaining the Predictions of Any Classifier. Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD), 1135–1144.
Sætren, G. B., Lindheim, C., Skogstad, M. R., Andreas Pedersen, P., Robertsen,
R., Lødemel, S., & Haukeberg, P. J. (2019). Simulator versus Traditional
Training: A Comparative Study of Night Driving Training. Proceedings of the
Human Factors and Ergonomics Society Annual Meeting, 63 (1), 1669–1673.
Serradilla, O., Zugasti, E., Ramirez de Okariz, J., Rodriguez, J., & Zurutuza, U.
(2021). Adaptable and Explainable Predictive Maintenance: Semi-Supervised
Deep Learning for Anomaly Detection and Diagnosis in Press Machine Data.
Applied Sciences, 11 (16), 7376.
Sokolova, M., & Lapalme, G. (2009). A Systematic Analysis of Performance Measures
for Classification Tasks. Information Processing & Management, 45 (4),
427–437.
Voigt, P., & Von Dem Bussche, A. (2017). The EU General Data Protection
Regulation (GDPR) - A Practical Guide. Springer International Publishing.
Wilcoxon, F. (1992). Individual Comparisons by Ranking Methods. In S. Kotz &
N. L. Johnson (Eds.), Breakthroughs in Statistics (pp. 196–202). Springer
New York.
Wu, S.-L., Tung, H.-Y., & Hsu, Y.-L. (2020). Deep Learning for Automatic Quality
Grading of Mangoes: Methods and Insights. 2020 19th IEEE International
Conference on Machine Learning and Applications (ICMLA), 446–453.
210
Paper D
211
Paper E
E
Paper F