Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: Federal Institute of Pará - IFPA. Ananindeua, 67125-000, Brazil 22institutetext: Federal University of Pará - UFPA. Belém, 66075-10, Brazil 33institutetext: Vale Institute of Technology - ITV, Belém, 66055-090, Brazil
E-mail:33email: jose.ribeiroifpa.edu.br, lucas.cardoso@icen.ufpa.br, vitor.cirilo.santos@itv.org, eduardo.costa.carvalho@itv.org, nikolas.carneiro@itv.org, ronnie.alves@itv.org

How Reliable and Stable are Explanations of XAI Methods?

José Ribeiro 112233 0000-0002-8836-4188    Lucas Cardoso 2233 0000-0003-3838-3214    Vitor Santos 33 0000-0002-7960-3079    Eduardo Carvalho 33 0000-0001-9999-4313    Níkolas Carneiro 33 0000-0002-5097-0772    Ronnie Alves 2233 0000-0003-4139-0562
Abstract

Black box models are increasingly being used in the daily lives of human beings living in society. Along with this increase, there has been the emergence of Explainable Artificial Intelligence (XAI) methods aimed at generating additional explanations regarding how the model makes certain predictions. In this sense, methods such as Dalex, Eli5, eXirt, Lofo and Shap emerged as different proposals and methodologies for generating explanations of black box models in an agnostic way. Along with the emergence of these methods, questions arise such as “How Reliable and Stable are XAI Methods?”. With the aim of shedding light on this main question, this research creates a pipeline that performs experiments using the diabetes dataset and four different machine learning models (LGBM, MLP, DT and KNN), creating different levels of perturbations of the test data and finally generates explanations from the eXirt method regarding the confidence of the models and also feature relevances ranks from all XAI methods mentioned, in order to measure their stability in the face of perturbations. As a result, it was found that eXirt was able to identify the most reliable models among all those used. It was also found that current XAI methods are sensitive to perturbations, with the exception of one specific method.

Keywords:
Explainable Artificial Intelligence Reliable Item Response Theory Machine Learning.

1 Introduction

Technology is increasingly evolving on different fronts and efforts, today Artificial Intelligence is a reality in everyday life in our society. There are many real-world problems that machine learning algorithms seek to solve, making human life more automated, intelligent and less complicated [14].

Black-box models are driven by algorithms that, despite normally presenting high performance scores when faced with classification or regression problems, are not capable of self-explaining their predictions. Transparent models are algorithms that, despite normally presenting lower performance scores when faced with classification and regression problems, are capable of self-explaining their predictions through visual structures created by themselves [22].

With the growing need for models that present high performance — which implies low transparency [3] — even in sensitive contexts, there is an increasing demand for methods or tools that can provide information about local explanations (explanation of feature relevance generated around each data instance) and global explanations (when it is possible to understand the logic of all instances of the generated model globally) as a means of making predictions more easily interpretable and also more reliable by humans [15].

The terminologies “feature relevance ranking” and “feature importance ranking” are widely used as synonyms in the computing community, but have different definitions herein, as shown in [3]. Feature rankings are regarded as ordered structures whereby each feature of the dataset used by the model appears in a position indicated by a score. The main difference being that, in relevance ranking, the calculation of the score is based on the model output, whereas to calculate the importance ranking of features, the correct label to be predicted is used [3].

In this context, methods like eXirt [31], Dalex [7], Eli5 [18], Lofo [29] , Shap [21] and Skater [24] emerged to promote the creation of model-agnostic (method that does not depend on the type of model to be explained) and model-specific explanations (method depend of specific type of model) [17].

The eXirt method was recently developed and published in [31], is based on Item Response Theory (IRT) and its properties. It generates model explanations through feature relevance ranking and information based on IRT properties, allowing a human individual to gain confidence in the model.

Defending the idea of an XAI method capable of explaining machine learning models, while also generating evidence regarding the reliability of this model, such as eXirt, is not an easy task. Thus, this article sheds light on how the IRT used by eXirt is capable of indicating whether specific models are reliable or not, also showing how stable current XAI methods are. Seeking to answer two main hypotheses: “Are the most reliable models, according to the eXirt method, those being less affected by data perturbations?” and “Do existing XAI methods generate stable explanations even after data perturbations?”.

Aiming to answer the questions above, this research uses in its analyzes the dataset diabetes [30] to build 4 different types of machine learning models, making predictions in the models using test data with perturbations and without perturbation, it creates explanations of these models and finally analyzes the features relevance ranks of the models and the parameters that indicate which models are more reliable, this last analysis generated exclusively by eXirt.

The main contributions of this article to the computing community focused on machine learning are: Evaluating IRT methodology as a reliable strategy for XAI methods; Benchmark comparison of state-of-the-art XAI methods exploring stability of models explainabilities; Providing a visual tool to explore reliable of XAI methods through the utilization of IRT’s Item Characteristic Curves (ICC).

2 Related works

This study conducted an literature review, aiming to identify research that proposes existing XAI methods. This allowed for the identification of the main XAI techniques specifically designed to generate global feature relevance rankings, both in a model-agnostic and model-specific manner, applicable to tabular data.

As a result, a total of six XAI methods were found to be properly validated and compatible with one another (at library and code execution dependencies level). These methods include: eXirt [31], Dalex [6], Eli5 [18], Lofo [29], SHAP [21] and Skater [24].

This survey found other tools aimed at model explanation, including: Alibi [2], CIU [13], Lime [26], IBM Explainable AI 360 [5], Anchor [27], Attention [20] e Interpreter ML [23]. However, due to incompatibilities and technical problems, they ended up not being used in this research.

The primary issues and incompatibilities identified were: absence of global rank generation; rank generation dependent on another existing XAI method within the pipeline; incompatibility with pipeline dependencies at the library version level; and outdated method libraries.

Note that the six methods presented herein generate relevance rankings based on the same previously trained machine learning models (with the same training and testing split), manipulate their inputs and/or produce new intermediate models copies. Therefore, they are required to be compatible with each other so that a fair comparison of their final rankings of explanations can be made. Table 1 shows a general comparison between the techniques found during bibliographic research.

Table 1: Overall view of XAI methods
Name
Base algorithm Explanation technique Global explanation (by rank) Local explanation Model Specific or Agnostic? Compatible?
Alibi Out-of-bag error Feature Permutation and accuracy and f1 Yes Yes Agnostic No
Anchor if-Then Rules Rules No Yes Agnostic No
Attention Structured Self-attentive embedding Multiple Vector Representations No Yes Specific No
CIU Decision Theory Feature Permutation and Multiple Criteria Decision Making Yes (deprecated) No Agnostic No
Dalex Leave-one covariate out Feature Permutation Yes Yes Agnostic Yes
Eli5 Assigning weights to decisions Feature Permutation and Mean Decrease Accuracy Yes Yes Specific Yes
eXirt Item Response Theory Feature Permutation and Model Ability Yes Yes Specific Yes
IBM Explainable AI 360 Same of Shap Same of Shap Yes Yes Specific No
Interpreter ML Same of Lime and Shap Same of Lime and Shap Yes Yes Specific and Agnostic No
Lime local linear approximation Perturbation of the Instance No Yes Agnostic No
Lofo Leave One Feature Out Feature Permutation Yes No Specifc Yes
Shap (Kernel) Game Theory Feature Permutation Yes Yes Agnostic Yes
Skater Information Theory Feature Relevance Yes Yes Agnostic Yes

In previous studies, this research used the CIU in its tests, however when carrying out this study it was found that its creators updated their libraries and apparently this method no longer generates feature relevance rankings as previously, which is why it appears as incompatible. Still in table 1, it can be seen that most existing XAI methods use the “Feature Permutation” technique to perform the model explanation process. However, it should be emphasized at this moment, that the eXirt differs from other methods by having base in IRT.

Note, although eXirt is understood as a model-specific method, table 1. In the research described here, it will be tested with models that go beyond the tree-ensemble, given that its architecture is generalist, as are the two other model-agnostic methods, as mentioned in [31].

3 Background

3.1 Explainable Artificial Intelligence

Tools like Dalex, Eli5, eXirt, Lofo, SHAP, and Skater can generate various types of explanations. However, for quantitative comparisons, only their ranking generation process are described. To clarify how each method generates feature relevance explanations, their basic operations are detailed below.

The eXit is the newest XAI method aimed at creating model explanations from feature relevance ranks based on Item Response Theory. This theory is the same one used in the evaluation of candidates taking tests such as the National Secondary Education Examination from Brazil (in portuguese is ENEM). Making a quick analogy, in the eXirt method, the dataset is considered a proof, the model is considered a candidate to answer the proof, the instances of the dataset are the questions, the value of the features of the instances are considered the commands of the questions and the objective value is considered the answer to each question. The eXirt based on IRT can evaluate candidates or models through 3 different perspectives: difficulty, discrimination and guessing, generating feature relevance rankings [31].

Dalex is a set of XAI tools based on the LOCO (Leave One Covariate Out) approach. It receives the model and data, calculates model performance, performs new training with modified datasets, iteratively inverts each feature, and evaluates model performance based on these inversions to identify important features [7].

Leave One Feature Out (Lofo) is similar to Dalex but removes features iteratively instead of inverting them. It evaluates model performance with all features, removes one feature at a time, retrains the model, and assesses performance on a validation dataset, reporting the mean and standard deviation of each feature’s relevance [29].

Eli5 uses the Mean Decrease Accuracy algorithm to rank feature relevance by measuring performance decline when a feature is removed from the test dataset [18].

SHapley Additive exPlanations (SHAP) explains a prediction by calculating the contribution of each feature, based on game theory’s Shapley Value. Features are iteratively included and excluded from models to compute their Shapley Values, generating a relevance ranking [21].

Skater calculates feature relevance based on Information Theory, measuring entropy changes in predictions when a feature is perturbed. Although now closed source, it remains popular in the XAI community [24].

3.2 Item Response Theory

Item Response Theory (IRT), part of Psychometrics, provides mathematical models to estimate latent traits, relating the probability of a specific response to the characteristics of the items evaluated. Traditional assessment methods measure performance by the total number of correct answers, but have limitations, such as dealing with random answers and evaluating the difficulty of each test question. [1].

Unlike traditional assessments, IRT focuses on test items, evaluating performance based on the ability to get specific items correct, not just the total count of correct answers [16]. IRT seeks to evaluate unobservable latent characteristics of an individual, relating the probability of a correct answer to their latent traits, that is, to the individual’s ability in the area of knowledge evaluated. [10].

In summary, IRT consists of mathematical models that represent the probability of an individual getting an item correct, considering item parameters and the respondent’s ability. Different implementations of IRT exist in the literature, such as the “Rasch Dichotomous Model[19] and the “Birnbaum Three-Dimensional Model[8] (the last one used here). In the two topics below, it will be described how the main processes of this theory.

3.2.1 Estimation of Item Parameters.

This process involves the estimation of discrimination, difficulty, and guessing based on the 3PL model, using techniques such Maximum Likelihood Estimation - MLE. The objective is find the parameter values that maximize the probability of observing individuals’ actual responses to the items, through:

  • Discrimination: consists in how much a specific item i𝑖iitalic_i is able to differentiate between highly and poorly skilled respondents. It is understood that the higher its value, the more discriminative the item is. Ideally, a test should feature a gradual and positive discrimination;

  • Difficulty: represents how much a specific item i𝑖iitalic_i is hard to be responded correctly by respondents. Higher difficulty values represent more difficult items to answer;

  • Guessing: representing the probability that a respondent gets a specific item i𝑖iitalic_i right randomly. It can also be understood as the probability that a respondent with low ability will get the item right. It is also the smallest possible chance that an item will be correct regardless of the estimated ability of the respondent.

3.2.2 Estimation of ability.

This process is represented of logistic model 3PL3𝑃𝐿3PL3 italic_P italic_L, presented in the equation 1, consists of a model capable of evaluating the respondents of a test from the estimated ability (θjsubscript𝜃𝑗\theta_{j}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT), together with the correct answer probability P(Uij=1θj)𝑃subscript𝑈𝑖𝑗conditional1subscript𝜃𝑗P(U_{ij}=1\mid\theta_{j})italic_P ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ∣ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) calculated as a function of the individual skill j𝑗jitalic_j and the parameters of the item i𝑖iitalic_i.

P(Uij=1θj)=ci+(1ci)11+eai(θjbi)𝑃subscript𝑈𝑖𝑗conditional1subscript𝜃𝑗subscript𝑐𝑖1subscript𝑐𝑖11superscript𝑒subscript𝑎𝑖subscript𝜃𝑗subscript𝑏𝑖P(U_{ij}=1\mid\theta_{j})=c_{i}+(1-c_{i})\frac{1}{1+e^{-a_{i}(\theta_{j}-b_{i}% )}}italic_P ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ∣ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG (1)

The 3PL is used to model the relationship between individuals’ ability and the likelihood of correctly answering an item on a test. It assumes that the probability of a correct answer depends on three item parameters: item discrimination, item difficulty, and item guessing.

In the equation 1 the properties discrimination, difficulty and guessing of the items i𝑖iitalic_i, are represented respectively by the letters aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The θjsubscript𝜃𝑗\theta_{j}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the ability of the individual j𝑗jitalic_j, which is a continuous parameter representing the latent trait being measured. P(Uij=1θj)𝑃subscript𝑈𝑖𝑗conditional1subscript𝜃𝑗P(U_{ij}=1\mid\theta_{j})italic_P ( italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ∣ italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) represents the probability of correctly answering item i for an individual j𝑗jitalic_j with ability θ𝜃\thetaitalic_θ.

Thus, once the item parameters are estimated and the hit probability is calculated using the equation 1, the Item Characteristic Curve (ICC) can be obtained. The ICC defines the behavior of an item’s hit probability curve according to the parameters describing the item (aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT e cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and the respondents’ skill variance, figure 1.

Refer to caption
Figure 1: The Item Characteristic Curve - ICC, the letters a𝑎aitalic_a, b𝑏bitalic_b and c𝑐citalic_c represent the discrimination, difficulty and guessing properties, respectively.

As can be seen in figure 1, the hit probability on axis y𝑦yitalic_y is calculated by adding the values of the properties aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT e cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT found in an item and the variation of the skill θ𝜃\thetaitalic_θ. Thus, the property aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (discrimination) is responsible for the slope of the logistic curve; the property bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (difficulty) plots the curve as a function of skill in the logistic function; and the property cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (guessing) places the basis of the logistic function relative to the axis y𝑦yitalic_y.

These are basic foundations of IRT that will allow a better interpretation of the explanations generated by the XAI method eXirt, more information about the theory in question can be accessed from the study [31].

3.2.3 Model confidence from IRT.

The conceptual bases of IRT [10, 9, 1], a model is understood as reliable if it presents: Discrimination with higher values (or perfect discrimination), since high discrimination means that the model has greater probabilities of success even using little skill (Note, a negative discrimination means a problem in the analyzed instance, indicating low confidence in the prediction); Difficulty with lower values, since the lower the difficulty, the better and more reliable this model is; Guessing with lower values, since if a model gets few random predictions right, it means it is more reliable; Skill with higher values, the higher the model’s skill, the more reliable it is.

4 Methodology

Aiming to respond to the hypotheses initially launched, a pipeline was developed containing analyzes that can be viewed in the diagram in figure 2.

The pipeline starts in figure 2 (A), where a dataset was selected relating to a real-world problem (diabetes disease [30]) with a sensitive context (health) defined as binary classification ( aiming to simplify analyses). This database was standardized (using z score) and divided between training and testing in the proportion 70%|30%.

In figure 2 (C), analyzes of the main properties of the dataset were carried out, as was done in [25]. The diabetes dataset [30] (Pima Indian diabetes dataset) has 9 numeric features entitled: “Number of times pregnant”, “Plasma glucose concentration at 2 hours in an oral glucose tolerance test”, “Diastolic blood pressure (mm Hg)”, “Triceps skin fold thickness (mm)”, “2-Hour serum insulin (mu U/ml)”, “Body mass index (weight in kg/(height in m)2)”, “Diabetes pedigree function”, “Age (years)”, “Class variable (0 or 1)”. It has no missing data and has a 768 instances (500 referring to 0 and 268 referring to 1, classes).

Refer to caption
Figure 2: Representation of the pipeline of the experiments carried out.

Four different types of models were created in figure 2 (D), which are: Multilayer Perceptron (MLP), Light Gradient Boosting Machine (LGBM), K Nearest Neighbor (KNN) and Decision Tree (DT), all hyper parameterized (with crossvalidation with folds=4, evaluated by Area Under the Curve (AUC)) aiming to improve the adaptability and performance of the models in the process of generalizing the problem represented by the diabetes dataset.

The choice of the types of models was guided by the diversification of tests, including black box models and transparent models, thus LGBM was chosen as a representative of ensemble-based models (black box), the MLP algorithm with a representative of neural networks (black box), the DT algorithm as a representative of tree-based algorithms (transparent) and the KNN algorithm as a representative of algorithms based on instance distance (transparent). Evidently, many other types of models could be used to achieve a greater scope of results, but due to issues related to computational cost, it was decided to use 4 types.

Taking as a basis the test set of the dataset, figure 2 (B), 4 different versions of it were created, containing different percentage levels of instance perturbations: 0% (original test set), 4%, 6%, and 10%. These percentages were selected from tests with perturbations that varied between 0% to 40%, as the focus is to select different test sets that, through perturbations, would succinctly and gradually harm the performance of most models.

Tests were carried out with two types of perturbations, one that applied random noise controlled by percentage in all instances of the data passed to it, as well as in [22, 28]. And another form of perturbation that exchanges a specific percentage of the data instances, as done in [11]. After verifying the drop in performance of the perturbed models, it was decided to use only the permutation, as it proved to be better at perturbing the model aiming at its continuous drop in performance.

Then, XAI techniques were applied (Dalex, Eli5, eXirt, Lofo, Shap and Skater), figure 2 (E) aiming to create explanations of the models (tested based on of perturbed and unperturbed data). In the stage of figure 2 (G), all models generated explanations based on feature relevance ranks (compared using Bump Chart and Spearman Correlation [4]), but aiming to extract information about the confidence of models also generated explanations based on Item Characteristic Curve (ICC), exclusively from eXirt, figure 2 (F).

As above, it is clear that only a single dataset was used in the analyses. This fact does not limit the research results, as it is exactly what is intended as the focus of the analyses, as having a single dataset (clean, properly processed and without perturbations) and this data can be considered the most reliable, especially when compared to alternative versions of this data that were perturbated.

However, it is understood that the results collected through the analyzes of this pipeline are sufficient to answer the hypotheses in figure 2 (H) and (I), and can also be generalized to other machine learning contexts. For reproducibility purposes, follow the link to the official pipeline repository of this article https://github.com/josesousaribeiro/XAI-eXirt-vs-Trust.

5 Results and discussion

The first step to understanding the pipeline results is to observe the behavior of the models’ performances obtained from tests with perturbation and without perturbation. The table 2 shows the results of the 4 models, as well as the values obtained from the Accuracy, Precision, Recall, F1 and Roc AUC metrics, based on tests with different percentages of perturbations.

Table 2: Summary of the model’s performances, according to tests carried out.
- LGBM MLP DT KNN
Perturbation 0% 4% 6% 10% 0% 4% 6% 10% 0% 4% 6% 10% 0% 4% 6% 10%
Accuracy 0.76 0.75 0.75 0.74 0.74 0.57 0.58 0.57 0.72 0.59 0.59 0.57 0.71 0.71 0.71 0.69
Precision 0.70 0.68 0.68 0.65 0.66 0.37 0.38 0.35 0.65 0.38 0.38 0.35 0.61 0.61 0.61 0.57
Recall 0.57 0.56 0.56 0.53 0.53 0.29 0.31 0.28 0.48 0.28 0.28 0.26 0.49 0.49 0.49 0.46
F1 0.63 0.61 0.61 0.58 0.59 0.33 0.34 0.31 0.55 0.33 0.32 0.3 0.55 0.55 0.55 0.51
Roc AUC 0.72 0.71 0.71 0.69 0.69 0.51 0.52 0.50 0.67 0.52 0.52 0.5 0.66 0.66 0.66 0.63

It can be seen in the table 2 column with 0% perturbation, that the model with the best performance was LGBM, followed by MLP, DT and KNN. Such results were already expected, as according to [3], in general, black box models perform better than transparent models. Based on the perturbations added to the test set, it was possible to obtain gradual and succinct worsening of performance in the tests with the increase in perturbations (4%, 6% and 10%).

The percentages of perturbations 4% and 6% work as controls in the analyses, as it can be seen in some cases, a larger perturbation makes the model perform minimally better than a smaller perturbation. However, this does not occur with perturbations equal to 10%. In order to identify how significant the difference in performance of the models is, the statistical test Friedman Nemenyi [12] was carried out, figure 3.

Refer to caption
Figure 3: Statistical test summary Friedman Nemenyi.

In the test presented in figure 3, the p-value values of Friedman Nemenyi are shown relating to the comparisons of the statistical metrics shown in the table 2. In each cell of the matrix, the closer the value is to 0 (zero), the greater the statistical confidence that can be obtained. Thus, adopting the value of p-value=0.05absent0.05=0.05= 0.05 as a cutoff, it can be determined with at least 95% statistical confidence that the models present different performance. Thus, when observing the row “lgbm: original” and the columns “mlp: original”, “dt: original” and “knn: original”, it can be seen that the performances of the models without perturbations do not present a statistically significant difference.

5.1 Are the most reliable models, according to the eXirt method, those being less affected by data perturbations?

The figures 4 and 5 were generated exclusively by eXirt seeking to answer the first hypothesis released, where, the green and red lines represent the instances of the test dataset that were passed to the model. The black line (thicker) represents the average of the ICCs. The texts with difficulty, discrimination and guessing values present are averages of the curves found.

The experiments that generated the figures 4 and 5 follow the line of reasoning: when faced with testing models with instance perturbations and without perturbations, it must be understood that the most reliable models/tests (referring to how can a human trust) are the unperturbed ones. Therefore, it is expected that the results of unperturbed models will be more stable and therefore reliable.

Thus, when observing the figure 4, it can be seen that even for models with unperturbed tests (first column on the left of the figure), eXirt was able to indicate the most reliable model, in this case the LGBM model, as it presents the least difficulty (-2.18), discrimination (1.54) and lowest guessing value (0.14), compared to the MLP values.

Refer to caption
Figure 4: Summary of the global ICC of the LGBM and MLP models.

When observing the item characteristic curves generated from the insertion of perturbations, figure 4, it can be seen that the difficulty, discrimination and guessing values change succinctly, as in tests with perturbations of 0% and 10% of instances, the difficulty goes from -2.18 to -1.86 (LGBM) and from -1.77 to -1.78 (MLP). Discrimination increases from 1.54 to 1.59 (LGBM) and from 1.75 to 1.6 (MLP). Finally the guess goes from 0.14 to 0.17 (LGBM) and 0.18 to 0.19 (MLP). It is noted that the relationships between difficulty and discrimination are inversely proportional, with guessing being proportional to difficulty, not obeying linear proportional rules.

The results in figure 4 without perturbation show that eXirt was able to indicate the most reliable models, as it was able to indicate the best model that obtained the best performance (between MLP and LGBM) and also indicated the unperturbed models/tests as being the most reliable (less difficulty, greater discrimination and less guessing).

Regarding results from the KNN and DT models, figure 5, there are some impasses in the process of defining the most reliable model from eXirt, as regarding difficulty there is -1.82 (KNN) and -0.95 (DT), indicating KNN as more reliable. Regarding discrimination, there are 1.52 (KNN) and 1.58 (DT), indicating DT as the most reliable. Finally, the guess value is 0.21 (KNN) and 0.14 (DT), indicating that the model (DT) is the most reliable. In this case, depending on the perspective, one can choose KNN or DT as being the most reliable.

Refer to caption
Figure 5: Summary of the global ICC of the KNN and DT models.

When analyzing the inserted perturbations, figure 5, in addition to the gradual worsening of the difficulty, discrimination and guessing values, the numerous appearance of red lines in the ICC graphs stands out. These curves represent negative discrimination, which means that the model identified possible problems in predicting certain records in the dataset, problems related to feature values, that is, as seen in the results of the experiments shown in the figures 4 and 5 there are models that are more sensitive and less sensitive to problems such as pertubations in the input data, and eXirt showed the DT as the model that suffered most from these perturbations that are related to the nature of the data itself ( since red lines appear in high quantities even when the test has 0% perturbation).

Note, based on the 12 tests shown in the figures 4 and 5, it can be stated that eXirt was able to identify the least reliable models, with higher percentages of perturbations, through the values of the item characteristics. Reinforcing that negative discrimination (red lines in the graphs) were decisive to better characterize the models.

Responding to the hypothesis launched at the beginning of this sub-section, we have the following answer: Yes, since eXirt through the properties of difficulty, discrimination and guessing, was able to identify more reliable models (without perturbations) of the less reliable models (with perturbations). It is also able to identify which machine learning models are most reliable, even if these models are of different types and do not present a statistically significant difference in their performance.

5.2 Do existing XAI methods generate stable explanations even after data perturbations?

Aiming to answer the hypothesis released, the relevance ranks of features generated from the XAI methods Dalex, Eli5) were selected. , eXirt, Lofo, Shap and Skater, aiming to evaluate their behavior given the need to model explanations with no perturbations and with perturbations. The central idea here is to identify the methods that are most stable to perturbations.

In view of this, we have the figure 6 as a general summary that shows all the relevance ranks of features generated from the tests carried out. Each line references an XAI method, ordered from most stable (topmost in the figure) to least stable (bottommost in the figure). Each column references different models (LGBM, MLP, DT and KNN).

Sub-figures referring to the experiments are also shown, where the names of the features present in each rank are indicated on the y axis and the percentages of perturbations are indicated on the x axis: 0%, 4%, 6% and 10% (from left to right). In these sub-figures, the Spearman Correlation values existing between the ranks generated by models with the presence of perturbations in relation to the ranks generated by models without perturbations are also displayed. In the title of these sub-figures the sum of the calculated correlations is presented.

As shown in figure 6, it can be seen that the shap method was the most stable XAI method, presenting the same feature relevance rank for the four models, figure 6 (A, B, C and D), bearing the maximum value of the sums of correlations (sum = 3) in each model — with changes being observed in the ranks generated by shap only in perturbations above 30%, experiments carried out separately). Next are the results of the skater method, which presented the same ranks for the LGBM (sum = 3), MLP (sum = 3) and DT (sum = 3) models, generating results with lower correlations for the KNN (sum = 0.57), figure 6 (E, F, G and H).

Refer to caption
Figure 6: Summary of feature relevance ranks generated in all tests.

Next comes eXirt, which presented equal correlations between the ranks generated from the LGBM model (sum = 3) and high correlations in the explanations of the DT (sum = 2.6) and KNN (2.12) models, figure 6 (I,K and L). However, it is worth highlighting the low value of the sum of the correlations found from the MLP model (sum = 0), figure 6 (J). Next come the dalex, eli5 and lofo methods, respectively, as methods with lower stabilities in the face of perturbations in the model inputs (full chart: https://github.com/josesousaribeiro/XAI-eXirt-vs-Trust/blob/main/output/fig/bump_ranks.pdf).

Given these results, one can respond to the hypothesis launched at the beginning of this sub-section, which is: Partially yes, since not all XAI methods tested were capable of generating stable explanations in the face of perturbations in the model inputs. This shows that most XAI methods are currently sensitive to small changes in the way the data is expected in the prediction process, which makes these methods limited in sensitive contexts. Showing evidence that current XAI methods, with the exception of Shap, still need to improve their stability in the face of scenarios involving predictions of real data. However, it is noteworthy that eXirt differs from other methods because it is capable of generating relevant information about latent characteristics of the model, based on IRT.

6 Conclusion and Future Works

This research showed how reliable and stable the explanations of current XAI methods aimed at generating feature relevance ranks are. Highlighting eXirt as an XAI method capable of generating extra information regarding how reliable a model is from the IRT perspective (and its properties difficulty, discrimination and guessing). This research also showed that current XAI methods, with the exception of shap, are considerably sensitive to changes in model inputs, showing that these methods require greater attention when used in real-world problems.

As future work regarding the research, the following points stand out: Creation of an equation capable of transforming the values of difficulty, discrimination, and guessing properties generated by eXirt into a single score that allows for faster interpretation of confidence in the model; Testing eXirt with other types of prediction problems such as regression and multiclass classification; Exploring existing XAI methods through new tests and perturbations, aiming to evaluate their behavior in the face of adversity.

{credits}

6.0.1 \discintname

The authors declare that they have no conflicting interests with the subjects covered in this research.

References

  • [1] Andrade, D.F., Tavares, H.R., da Cunha Valle, R.: Teoria da resposta ao item: conceitos e aplicações. ABE, Sao Paulo (2000)
  • [2] Apley, D.W., Zhu, J.: Visualizing the effects of predictor variables in black box supervised learning models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82(4), 1059–1086 (2020)
  • [3] Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., Herrera, F.: Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion 58, 82–115 (Jun 2020)
  • [4] Artusi, R., Verderio, P., Marubini, E.: Bravais-pearson and spearman correlation coefficients: Meaning, test of hypothesis and confidence interval. The International Journal of Biological Markers 17(2), 148–151 (Apr 2002), publisher: SAGE Publications Ltd STM
  • [5] Arya, V., Bellamy, R.K., Chen, P.Y., Dhurandhar, A., Hind, M., Hoffman, S.C., Houde, S., Liao, Q.V., Luss, R., Mojsilovic, A., et al.: Ai explainability 360: An extensible toolkit for understanding data and machine learning models. J. Mach. Learn. Res. 21(130),  1–6 (2020)
  • [6] Baniecki, H., Kretowicz, W., Piatyszek, P., Wisniewski, J., Biecek, P.: Dalex: responsible machine learning with interactive explainability and fairness in python. The Journal of Machine Learning Research 22(1), 9759–9765 (2021)
  • [7] Biecek, P., Burzykowski, T.: Explanatory model analysis: explore, explain, and examine predictive models. CRC Press (2021)
  • [8] Birnbaum, A.L.: Some latent trait models and their use in inferring an examinee’s ability. Statistical theories of mental test scores (1968)
  • [9] Cardoso, L.F., de S. Ribeiro, J., Santos, V.C.A., Silva, R.L., Mota, M.P., Prudêncio, R.B., Alves, R.C.: Explanation-by-example based on item response theory. In: Intelligent Systems: 11th Brazilian Conference, BRACIS 2022, Campinas, Brazil, November 28–December 1, 2022, Proceedings, Part I. pp. 283–297. Springer (2022)
  • [10] Cardoso, L.F., Santos, V.C., Francês, R.S.K., Prudêncio, R.B., Alves, R.C.: Decoding machine learning benchmarks. In: Brazilian Conference on Intelligent Systems. pp. 412–425. Springer (2020)
  • [11] Chang, C.H., Creager, E., Goldenberg, A., Duvenaud, D.: Explaining image classifiers by counterfactual generation. In: International Conference on Learning Representations (2018)
  • [12] Demšar, J.: Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research 7, 1–30 (2006)
  • [13] Främling, K.: Decision theory meets explainable AI. In: International Workshop on Explainable, Transparent Autonomous Agents and Multi-Agent Systems. pp. 57–74. Springer (2020)
  • [14] Ghahramani, Z.: Probabilistic machine learning and artificial intelligence. Nature 521(7553), 452–459 (2015)
  • [15] Gunning, D., Aha, D.: DARPA’s Explainable Artificial Intelligence (XAI) Program. AI Magazine 40(2), 44–58 (Jun 2019). https://doi.org/10.1609/aimag.v40i2.2850, number: 2
  • [16] Hambleton, R.K., Swaminathan, H., Rogers, H.J.: Fundamentals of item response theory, vol. 2. Sage (1991)
  • [17] Khan, A.: Model-specific explainable artificial intelligence techniques: State-of-the-art, advantages and limitations (2022)
  • [18] Korobov, M., Lopuhin, K.: Eli5. https://eli5.readthedocs.io/en/latest/index.html (2021), Accessed January 21, 2021
  • [19] Kreiner, S.: The Rasch Model for Dichotomous Items, chap. 1, pp. 5–26. John Wiley & Sons, Ltd (2012)
  • [20] Lin, Z., Feng, M., dos Santos, C., Yu, M., Xiang, B., Zhou, B., Bengio, Y.: A structured self-attentive sentence embedding. In: International Conference on Learning Representations. International Conference on Learning Representations, ICLR (2017)
  • [21] Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st international conference on neural information processing systems. pp. 4768–4777 (2017)
  • [22] Molnar, C.: Interpretable Machine Learning. Lulu. com (2020)
  • [23] Nori, H., Jenkins, S., Koch, P., Caruana, R.: Interpretml: A unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223 (2019)
  • [24] Oracle: Skater. https://oracle.github.io/Skater/overview.html#skater (2021), Accessed January 21, 2021
  • [25] Ribeiro, J., Silva, R., Cardoso, L., Alves, R.: Does dataset complexity matters for model explainers? In: 2021 IEEE International Conference on Big Data (Big Data). pp. 5257–5265 (2021). https://doi.org/10.1109/BigData52589.2021.9671630
  • [26] Ribeiro, M.T., Singh, S., Guestrin, C.: “why should I trust you?”: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016. pp. 1135–1144 (2016)
  • [27] Ribeiro, M.T., Singh, S., Guestrin, C.: Anchors: High-precision model-agnostic explanations. In: AAAI Conference on Artificial Intelligence (AAAI) (2018)
  • [28] Robnik-Šikonja, M., Bohanec, M.: Perturbation-based explanations of prediction models. Human and Machine Learning: Visible, Explainable, Trustworthy and Transparent pp. 159–175 (2018)
  • [29] Roseline, S.A., Geetha, S.: Android malware detection and classification using lofo feature selection and tree-based models. In: Journal of Physics: Conference Series. vol. 1911, p. 012031. IOP Publishing (2021)
  • [30] Sigillito, V.: https://www.openml.org/search?type=data&id=37&sort=runs& status=active (2023), accessed March 1, 2024.
  • [31] de Sousa Ribeiro Filho, J., Cardoso, L.F.F., da Silva, R.L.S., Carneiro, N.J.S., Santos, V.C.A., de Oliveira Alves, R.C.: Explanations based on item response theory (exirt): A model-specific method to explain tree-ensemble model in trust perspective. Expert Systems with Applications 244, 122986 (2024)