U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Entropy (Basel)

Logo of entropy

Explainable AI: A Review of Machine Learning Interpretability Methods

Recent advances in artificial intelligence (AI) have led to its widespread industrial adoption, with machine learning systems demonstrating superhuman performance in a significant number of tasks. However, this surge in performance, has often been achieved through increased model complexity, turning such systems into “black box” approaches and causing uncertainty regarding the way they operate and, ultimately, the way that they come to decisions. This ambiguity has made it problematic for machine learning systems to be adopted in sensitive yet critical domains, where their value could be immense, such as healthcare. As a result, scientific interest in the field of Explainable Artificial Intelligence (XAI), a field that is concerned with the development of new methods that explain and interpret machine learning models, has been tremendously reignited over recent years. This study focuses on machine learning interpretability methods; more specifically, a literature review and taxonomy of these methods are presented, as well as links to their programming implementations, in the hope that this survey would serve as a reference point for both theorists and practitioners.

1. Introduction

Artificial intelligence (AI) had for many years mostly been a field focused heavily on theory, without many applications of real-world impact. This has radically changed over the past decade as a combination of more powerful machines, improved learning algorithms, as well as easier access to vast amounts of data enabled advances in Machine Learning (ML) and led to its widespread industrial adoption[ 1 ]. Around 2012 Deep Learning methods [ 2 ] started to dominate accuracy benchmarks, achieving superhuman results and further improving in the subsequent years. As a result, today, a lot of real-world problems in different domains, stretching from retail and banking [ 3 , 4 ] to medicine and healthcare [ 5 , 6 , 7 ], are tackled while using machine learning models.

However, this improved predictive accuracy has often been achieved through increased model complexity. A prime example is the deep learning paradigm, which is at the heart of most state-of-the-art machine learning systems. It allows for machines to automatically discover, learn, and extract the hierarchical data representations that are needed for detection or classification tasks. This hierarchy of increasing complexity combined with the fact that vast amounts of data are used to train and develop such complex systems, while, in most cases, boosts the systems’ predictive power, inherently reducing their ability to explain their inner workings and mechanisms. As a consequence, the rationale behind their decisions becomes quite hard to understand and, therefore, their predictions hard to interpret.

There is clear trade-off between the performance of a machine learning model and its ability to produce explainable and interpretable predictions. On the one hand, there are the so called black-box models, which include deep learning [ 2 ] and ensembles [ 8 , 9 , 10 ]. On the other hand, there are the so called white-box or glass-box models, which easily produce explainable results—with common examples, including linear [ 11 ] and decision-tree based [ 12 ] models. Although more explainable and interpretable, the latter models are not as powerful and they fail achieve state-of-the-art performance when compared to the former. Both their poor performance and the ability to be well-interpreted and easily-explained come down to the same reason: their frugal design.

Systems whose decisions cannot be well-interpreted are difficult to be trusted, especially in sectors, such as healthcare or self-driving cars, where also moral and fairness issues have naturally arisen. This need for trustworthy, fair, robust, high performing models for real-world applications led to the revival of the field of eXplainable Artificial Intelligence (XAI) [ 13 ]—a field focused on the understanding and interpretation of the behaviour of AI systems, which. in the years prior to its revival, had lost the attention of the scientific community, as most research focused on the predictive power of algorithms rather than the understanding behind these predictions. The popularity of the search term “Explainable AI” throughout the years, as measured by Google Trends, is illustrated in Figure 1 . The noticeable spike in recent years, indicating the of rejuvenation of the field, is also reflected in the increased research output of the same period.

An external file that holds a picture, illustration, etc.
Object name is entropy-23-00018-g001.jpg

Google Trends Popularity Index (Max value is 100) of the term “Explainable AI” over the last ten years (2011–2020).

The Contribution of this Survey

As the demand for more explainable machine learning models with interpretable predictions rises, so does the need for methods that can help to achieve these goals. This survey will focus on providing an extensive and in-depth identification, analysis, and comparison of machine learning interpretability methods. The end goal of the survey is to serve as a reference point for both theorists and practitioners not only by providing a taxonomy of the existing methods, but also by scoping the best use cases for each of the methods and also providing links to their programming implementations–the latter being found in the Appendix A section.

2. Fundamental Concepts and Background

2.1. explainability and interpretability.

The terms interpretability and explainability are usually used by researchers interchangeably; however, while these terms are very closely related, some works identify their differences and distinguish these two concepts. There is not a concrete mathematical definition for interpretability or explainability, nor have they been measured by some metric; however, a number of attempts have been made [ 14 , 15 , 16 ] in order to clarify not only these two terms, but also related concepts such as comprehensibility. However, all these definitions lack mathematical formality and rigorousness [ 17 ]. One of the most popular definitions of interpretability is the one of Doshi-Velez and Kim, who, in their work [ 15 ], define it as “the ability to explain or to present in understandable terms to a human”. Another popular definition came from Miller in his work [ 18 ], where he defines interpretability as “the degree to which a human can understand the cause of a decision”. Although intuitive, these definitions lack mathematical formality and rigorousness [ 17 ].

Based on the above, interpretability is mostly connected with the intuition behind the outputs of a model [ 17 ]; with the idea being that the more interpretable a machine learning system is, the easier it is to identify cause-and-effect relationships within the system’s inputs and outputs. For example, in image recognition tasks, part of the reason that led a system to decide that a specific object is part of an image (output) could be certain dominant patterns in the image (input). Explainability, on the other hand, is associated with the internal logic and mechanics that are inside a machine learning system. The more explainable a model, the deeper the understanding that humans achieve in terms of the internal procedures that take place while the model is training or making decisions. An interpretable model does not necessarily translate to one that humans are able to understand the internal logic of or its underlying processes. Therefore, regarding machine learning systems, interpretability does not axiomatically entail explainability, or vice versa. As a result, Gilpin et al. [ 16 ] supported that interpretability alone is insufficient and that the presence of explainability is also of fundamental importance. Mostly aligned with the work of Doshi-Velez and Kim [ 15 ], this study considers interpretability to be a broader term than explainability.

2.2. Evaluation of Machine Learning Interpretability

Doshi-Velez and Kim [ 15 ] proposed the following classification of evaluation methods for interpretability: application-grounded, human-grounded, and functionally-grounded, subsequently discussing the potential trade-offs among them. Application-grounded evaluation concerns itself with how the results of the interpretation process affect the human, domain expert, end-user in terms of a specific and well-defined task or application. Concrete examples under this type of evaluation include whether an interpretability method results in better identification of errors or less discrimination. Human-grounded evaluation is similar to application-grounded evaluation; however, there are two main differences: first, the tester in this case does not have be a domain expert, but can be any human end-user and secondly, the end goal is not to evaluate a produced interpretation with respect to its fitness for a specific application, but rather to test the quality of produced interpretation in a more general setting and measure how well the general notions are captured. An example of measuring how well an interpretation captures the abstract notion of an input would be for humans to be presented with different interpretations of the input, and them selecting the one that they believe best encapsulates the essence of it. Functionally-grounded evaluation does not require any experiments that involve humans, but instead uses formal, well-defined mathematical definitions of interpretability to evaluate quality of an interpretability method. This type of evaluation usually follows the other two types of evaluation: once a class of models has already passed some interpretability criteria via human-grounded or application-grounded experiments, then mathematical definitions can be used to further rank the quality of the interpretability models. Functionally-grounded evaluation is also appropriate when experiments that involve humans cannot be applied for some reason (e.g ethical considerations) or when the proposed method has not reached a mature enough stage to be evaluated by human users. That said, determining the right measurement criteria and metric for each case is challenging and remains an open problem.

2.3. Related Work

The concepts of interpretability and explainability are hard to rigorously define; however, multiple attempts have been made towards that goal, the most emblematic works being [ 14 , 15 ].

The work of Gilpin et al. [ 16 ] constitutes another attempt to define the key concepts around interpretability in machine learning. The authors, while focusing mostly on deep learning, also proposed a taxonomy, by which the interpretability methods for neural networks could be classified into three different categories. The first one encompasses methods that emulate the processing of data in order to create insights for the connections between inputs and outputs of the model. The second category contains approaches that try to explain the representation of data inside a network, while the last category consists of transparent networks that explain themselves. Lastly, the author recognises the promising nature of the progress achieved in the field of explaining deep neural networks, but also highlights the lack of combinatorial approaches, which would attempt to merge different techniques of explanation, claiming that such types of methods would result in better explanations.

Adadi and Berrada [ 17 ] conducted an extensive literature review, collecting and analysing 381 different scientific papers between 2004 and 2018. They arranged all of the scientific work in the field of explainable AI along four main axes and stressed the need for more formalism to be introduced in the field of XAI and for more interaction between humans and machines. After highlighting the trend of the community to explore explainability only in terms of modelling, they proposed embracing explainability in other aspects of machine learning. Finally, they suggested a potential research direction that would be towards the composition of existing explainability methods.

Another survey that attempted to categorise the existing explainability methods is this of Guidotti et al. [ 19 ]. Firstly, the authors identified four categories for each method based on the type of problem that they were created to tackle. One category for explaining black-box models, one for inspecting them, one for explaining their outcomes, and, finally, one for creating transparent black box models. Subsequently, they proposed a taxonomy that takes into account the type of underlying explanation model (explanator), the type of data used as input, the problem the method encounters, as well as the black box model that was “opened”. As with works previously discussed, the lack of formality and need for a definition of metrics for evaluating the performance of interpretability methods was highlighted once again, while the incapacity of most black-box explainability methods to interpret models that make decisions based on unknown or latent features was also raised. Lastly, the lack of interpretability techniques in the field of recommender systems is identified and an approach according to which models could be learned directly from explanations is proposed.

Upon identifying the lack of formality and ways to measure the performance of interpretability methods, Murdoch et al. [ 20 ] published a survey in 2019, in which they created an interpretability framework in the hope that it would help to bridge the aforementioned gap in the field. The Predictive, Descriptive, Relevant (PDR) framework introduced three types of metrics for rating the interpretability methods, predictive accuracy, descriptive accuracy, and relevancy. To conclude, they dealt with transparent models and post-hoc interpretation, as they believed that post-hoc interpretability could be used to elevate the predictive accuracy of a model and that transparent models could increase their use cases by increasing predictive accuracy—making clear, that, in some cases, the combination of the two methods is ideal.

A more recent study carried out by Arrieta et al. [ 21 ] introduced a different type of arrangement that initially distinguishes transparent and post-hoc methods and subsequently created sub-categories. An alternative taxonomy specifically for the deep learning interpretability methods, due to their high volume, was developed. Under this taxonomy, four categories were proposed: one for providing explanations regarding deep network processing, one in relation to the explanation of deep network representation, one concerned with the explanation of producing systems, and one encompassing hybrids of transparent and black-box methods. Finally, the authors dived into the concept of Responsible Artificial Intelligence, a methodology introducing a series of criteria for implementing AI in organizations.

3. Different Scopes of Machine Learning Interpretability: A Taxonomy of Methods

Different view-points exist when it comes to looking at the the emerging landscape of interpretability methods, such as the type of data these methods deal with or whether they refer to global or local properties. The classification of machine learning interpretability techniques should not be one-sided. There are exist different points of view, which distinguish and could further divide these methods. Hence, in order for a practitioner to identify the ideal method for the specific criteria of each problem encountered, all aspects of each method should be taken into consideration.

A especally important separation of interpretability methods could happen based on the type of algorithms that could be applied. If their application is only restricted to a specific family of algorithms, then these methods are called model-specific. In contrast, the methods that could be applied in every possible algorithm are called model agnostic. Additionally, one crucial aspect of dividing the interpretability methods is based on the scale of interpretation. If the method provides an explanation only for a specific instance, then it is a local one and, if the method explains the whole model, then it is global. At last, one crucial factor that should be taken into consideration is the type of data on which these methods could be applied. The most common types of data are tabular and images, but there are also some methods for text data. Figure 2 presents a summarized mind-map, which visualizes the different aspects by which an interpretability method could be classified. These aspects should always be taken into consideration by practitioners, in order for the ideal method with respect to their needs to be identified.

An external file that holds a picture, illustration, etc.
Object name is entropy-23-00018-g002.jpg

Taxonomy mind-map of Machine Learning Interpretability Techniques.

This taxonomy focuses on the purpose that these methods were created to serve and the ways through which they accomplish this purpose. As a result, according to the presented taxonomy, four major categories for interpretability methods are identified: methods for explaining complex black-box models, methods for creating white-box models, methods that promote fairness and restrict the existence of discrimination, and, lastly, methods for analysing the sensitivity of model predictions.

3.1. Interpretability Methods to Explain Black-Box Models

This first category encompasses methods that are concerned with black-box pre-trained machine learning models. More specifically, such methods do not try to create interpretable models, but, instead, try to interpret already trained, often complex models, such as deep neural networks. That is also why they sometimes are referred to as post-hoc interpretability methods in the related scientific literature.

Under this taxonomy, this category, due to the volume of scientific work around deep learning related interpretability methodologies, is split into two sub-categories, one specifically for deep learning methods and one concerning all other black-box models. For each of these sub-categories, a summary of the included methods is shown in Table 1 and Table 2 respectively.

Interpretability Methods to Explain Deep Learning Models.

Interpretability Methods to Explain any Black-Box Model.

3.1.1. Interpretability Methods to Explain Deep Learning Models

The widespread adoption of deep learning methods, combined with the fact that it is in their very nature to produce black-box machine learning systems, has led to a considerable amount of experiments and scientific work around them and, therefore, tools regarding their interpretability. A substantial portion of attention regarding python tools is focused on deep learning for images more specifically on the concept of saliency in images, as initially proposed in [ 22 ]. Saliency refers to unique features, such as pixels or resolution of the image in the context of visual processing. These unique features depict the visually alluring locations in an image and a saliency map is a topographical representation of them.

Gradients: first proposed in [ 23 ], the gradients explanation technique, as its name suggests, is gradient-based attribution method, according to which each gradient quantifies how much a change in each input dimension would a change the predictions in a small neighborhood around the input. Consequently, the method computes an image-specific class saliency map corresponding to the gradient of an output neuron with respect to the input, highlighting the areas of the given image, discriminative with respect to the given class. An improvement over the initial method was proposed in [ 24 ], where the well-known Krizhevsky network [ 25 ] was utilised in order to outperform state-of-the-art saliency models by a large margin, increasing the amount of explained information by 67% when compared to state-of-the art. Furthermore, in [ 26 ], a task-specific pre-training scheme was designed in order to make the multi-context modeling suited for saliency detection.

Integrated Gradients [ 27 ] is gradient-based attribution a method that attempts to explain predictions that are made by deep neural network by attributing them to the network’s input features. It is essentially is a variation on calculating the gradient of the prediction output with respect to the features of the input, as implemented by the simpler Gradients method. Under this variation, a much desired property, which is known as completeness or Efficiency [ 28 ] or Summation to Delta [ 29 ], is satisfied: the attributions sum up to the target output minus the target output that was evaluated at the baseline. Moreover, two fundamental axioms that attribution methods ought to satisfy are identified: sensitivity and implementation invariance. Upon highlighting that most known attribution methods do not satisfy these axioms, they propose the integrated gradients method as a simple way obtain great interpretability results. Another work, closely related to the integrated gradients method, was proposed in [ 30 ], where attributions are used in order to help identify weaknesses of three question-answer models better than the conventional models, while also to provide workflow superiority.

DeepLIFT [ 29 ] is a popular algorithm that was designed to be applied on top of deep neural network predictions. The method, as described in [ 29 ], is an improvement over its first form [ 29 ], also known as the “Gradient * Input” method, where it was observed that saliency maps that were obtained using the gradient method can be greatly enhanced by multiplying the gradient with the input signal—an operation that is essentially a first-order Taylor approximation of how the output would change if the input were set to zero. The method’s superiority was demonstrated by showing considerable benefits over gradient-based methods when applied to models that were trained on natural images and genomics data. By observing the activation of each neuron, it assigns them contribution scores, calculated by comparing the difference of the output from some reference output to the differences of the inputs from their reference inputs. By optionally giving separate consideration to positive and negative contributions, DeepLIFT can also reveal dependencies that are missed by other approaches, such as the Integrated Gradients approach [ 27 ].

Guided BackPropagation [ 31 ], which is also known as guided saliency, is a variant of the deconvolution approach [ 32 ] for visualizing features learned by CNNs, which can also be applied to a broad range of network structures. Under this approach, the use of max-pooling in convolutional neural networks for small images is questioned and the replacement of max-pooling layers by a convolutional layer with increased stride is proposed, resulting in no loss of accuracy on several image recognition benchmarks.

Deconvolution, as proposed in [ 32 ], is a technique for visualizing Convolutional Neural Networks (CNNs or ConvNets) by utilising De-convolutional Networks (DeconvNets or DCNNs), as initially proposed in [ 33 ]. DeconvNets use the same components, such as filtering and pooling, but in reverse fashion: instead of mapping pixels to features, they apply the opposite. Originally, in [ 33 ] DeconvNets were proposed as a way of performing unsupervised learning; however, in [ 32 ] they are not used in any learning capacity, but rather as a tool to provide insight into the function of intermediate feature layers and pieces of information of an already trained CNN. More specifically, a novel way of mapping feature activity in intermediate layers back to the input feature space (pixels in the case of images) was proposed, showing what input pattern originally caused a given activation in the feature maps. This is done through a DeconvNet being attached to each of CNN layers, providing a continuous path back to image pixels.

Class Activation Maps, or CAMs, first introduced in [ 34 ], is another deep learning intrepretability method used for CNNs. More specifically, it’s used to indicate the discriminative regions of an image used by a CNN to identify the category of the image. A feature vector is created by computing and concatenating the averages of the activations of convolutional feature maps that are located just before the final output layer. Subsequently, a weighted sum of this vector is fed to the final softmax loss layer. Using this simple architecture, the importance of the image regions, pertaining to their classification, can, therefore, be identified by projecting back the weights of the output layer on to the convolutional feature maps. CAM has two distinct drawbacks: Firstly, in order to be applied, it requires that neural networks have a very specific structure in their final layers and, for all other networks, the structure needs to be changed and the network needs to be re-trained under the new architecture. Secondly, the method, being constrained to only visualising the final convolutional layers of a CNN, is only useful when it comes to interpreting the very last stages of the network’s image classification and it is unable to provide any insight into the previous stages.

Grad-CAM [ 35 ] is a strict generalization of CAM that can produce visual explanations for any CNN, regardless of its architecture, thus overcoming one of the limitations of CAM. As a gradient-based method, Grad-CAM uses the class-specific gradient information flowing into the final convolutional layer of a CNN in order to produce a coarse localization map of the important regions in the image when it comes to classification, making CNN-based models more transparent. The authors of Grad-CAM also demonstrated how the technique can be combined with existing pixel-space visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM. By generating visual explanations in order to better understand image classification of popular networks while using both Grad-CAM and Guided Grad-CAM, it was shown that the proposed techniques outperform pixel-space gradient visualizations (Guided Backpropagation and Deconvolution) when evaluated in terms of localisation (the ability to localise objects in images using holistic image class labels only) and faithfulness (the ability to accurately explain the function learned by a model). While an improvement over CAM, Grad-CAM has its own limitations, the most notable including its inability to localize multiple occurrences of an object in an image, due its partial derivative assumptions, its inability to accurately determine class-regions coverage in an image, and the possible loss in signal due the continual upsampling and downsampling processes.

Grad-CAM++ [ 36 ] is an extension of the Grad-CAM method that provides better visual explanations of CNN model predictions. More specifically, object localization is extended to multiple object instances in a single image while using a weighted combination of the positive partial derivatives of the last convolutional layer feature maps with respect to a specific class score as weights to generate a visual explanation for the corresponding class label. This is especially helpful in multi-label classification problems, while the different weight assigned to each pixel makes it possible to capture the importance of each pixel separately in the gradient feature map.

Layer-wise Relevance Propagation (LRP) [ 37 ] is a “decomposition of nonlinear classifiers” technique that brings interpretability to highly complex deep neural networks by propagating their predictions backwards. The proposed propagation procedure satisfies a conservation property, whereby the magnitude of any output is remains intact, as it is backpropagated through the lower-level layers of the network: Starting from the output neurons going all the way back to the input-layer neurons, each neuron redistributes to the lower layer the same amount of information as it received from the higher layer. The method can be applied to various data types, such as images, text, and more, as well as various neural network architectures.

By pointing out and exploiting the fact that the gradient of the loss function with respect to the input can be interpreted as a sensitivity map, Smilkov et al. [ 38 ] created SmoothGrad, a method that can be applied in order to reduce noise in order visually sharpen such sensitivity maps. SmoothGrad can be combined with other sensitivity map algorithms, such as the Integrated Gradients [ 27 ] and Guided BackPropagation [ 31 ], in order to produce enhanced sensitivity maps—more specifically, two smoothing approaches were explored and experimented with: The first one, which had an excellent smoothing impact, calculates the average of maps made from many small perturbations of a given instance, while the second perturbs the data with random noise and then performs the training step. The experiments showed that these two techniques can have an additive effect, and combining them provides superior results to applying them separately. Upon performing a series of experiments, the authors conclude that the estimated smoothed gradient leads to sharper visualisations and more coherent sensitivity maps when compared to the non-smoothed gradient.

In order to interpret the predictions of deep neural networks for images, the RISE algorithm [ 39 ] creates a saliency map for any black-box model, indicating how important each pixel of the image with respect to the network’s prediction. The method follows a simple yet powerful approach: each input image is multiplied element-wise with random masks and the resulting image is subsequently fed to the model for classification. The model produces a probability-like score for the masked images with respect to each of the available classes and a saliency map for the original image is created as a linear combination of the masks. The coefficients of this linear combination are calculated while using the score that was produced by the model for the corresponding masked inputs with respect to target class.

In [ 42 ], the idea of Concept Activation Vectors (CAVs) was introduced, providing a human-friendly interpretation of a neural network internal state; an intuition of how sensitive a prediction is to a user-defined concept and how important the concept is to the classification itself. One of the issues with saliency maps is that concepts in an image, such as the “human” concept or the “animal” concept, cannot be expressed as pixels and are not in the input features either and therefore cannot be captured by saliency maps. To address this CAVs try to provide a translation between the input vector space and the high-level concept space; a CAV corresponding to a concept is essentially a vector in the direction of the values (the result of activation functions in a network’s neurons) of that concept’s set of examples. By utilising CAVs, the TCAV method provides a quantitative measure of importance of a concept if and only if the network has learned about it. Furthermore, TCAV can reveal any concept learnt, even if it was not explicitly tagged within the training set or even if was not part of the input feature set.

Yosinski et al. [ 40 ] proposed applying regularisation as an additional processing step in the saliency map creating process. More specifically, by introducing four primary regularization techniques, they enforced stronger prior distributions in order to promote bias towards more recognisable and interpretable visualisations. They showed that the best results were obtained when the different regularisers were combined, while each of these regularisation methods can also individually enhance interpretability.

In [ 43 ], an interpretability technique for neural networks operating in the natural language processing (NLP) domain was proposed. Under this approach, smaller, tailored pieces of the original input text are extracted and then used as input in order to try and produce the same output prediction as the original full-text input. These small pieces, called rationales, provide the necessary explanation and justification for the output in terms of the input. The architecture consists of two components, a generator and an encoder, which are trained to function well as a whole. The generator produces candidate rationales, and the encoder uses them to produce predicted probability scores. The generator and the encoder are trained jointly, and, through the minimization of the cost function, it is decided which candidates will be characterised as rationals. Essentially, the two components work together in order to find subsets of text that are highly associated with the predicted score.

Deep Taylor decomposition [ 41 ] is a method that decomposes a neural network’s output, for given input instance, into contributions of this instance by backpropagating the explanations from the output layer to the input. Its usefulness was demonstrated within the computer vision paradigm, in order to measure the importance of single pixels in image classification tasks; however, the method can also be applied to different types of data as both a visualization tool as well as a tool for more complex analysis. The proposed approach has strong links to relevance propagation; the theoretical connections between the Taylor decomposition of a function and rule-based relevance propagation techniques are thoroughly discussed, demonstrating a close relationship between the two approaches for a particular class of neural networks. Deep Taylor decomposition produces heatmaps, enable the user to deeply understand the impact of each single input pixel when classifying a previously unseen image. It does not require hyperparameter tuning, is robust under different architectures and datasets, and works both with custom deep network models as well as with existing pre-trained ones.

Kindermans et al. [ 44 ] showed that, while the Deconvolution [ 32 ], Guided BackPropagation [ 31 ], and LRP [ 37 ] methods help in interpreting deep neural networks, they do not produce the theoretically correct interpretation, even in the simplest neural network setting; a linear model developed while using data that were produced by a linear generative model. Using this simplified setup, the authors showed that the direction of the network’s gradient does not necessarily provide an estimate for the signal in the data, but instead corresponds to the relationship between the signal and noise; the array of parameters that are learnt by the network is in the noise-cancelling direction rather than the direction of the signal. In order to address this issue, after introducing a quality criterion for neuron-wise signal estimators in order to evaluate existing methods and ultimately obtain estimators that optimize towards this criterion, the authors propose two interpretation methods that are are theoretically sound for linear models, PatternNet and PatternAtrribution. The former is used to estimate the correct direction, improving upon the DeConvNet[ 32 ] and Guided BackPropagation[ 31 ] visualizations, while the latter to identify how much the different signal dimensions contribute to the output through the network layers. As both of the methods treat neurons independently, the produced interpretation is a superposition of the interpretations of the individual neurons.

In Figure 3 , a comparison of several interpretability methods for explaining deep learning models on ImageNet sample images, while using the innvestigate package, is presented.

An external file that holds a picture, illustration, etc.
Object name is entropy-23-00018-g003.jpg

Comparison of Interpretability Methods to Explain Deep Learning Models on ImageNet sample images, using the innvestigate package.

3.1.2. Interpretability Methods to Explain any Black-Box Model

This section focuses on interpretability techniques, which can be applied to any black-box model. First introduced in [ 45 ], the local interpretable model-agnostic explanations (LIME) method is one of the most popular interpretability methods for black-box models. Following a simple yet powerful approach, LIME can generate interpretations for single prediction scores produced by any classifier. For any given instance and its corresponding prediction, simulated randomly-sampled data around the neighbourhood of input instance, for which the prediction was produced, are generated. Subsequently, while using the model in question, new predictions are made for generated instances and weighted by their proximity to the input instance. Lastly, a simple, interpretable model, such as a decision tree, is trained on this newly-created dataset of perturbed instances. By interpreting this local model, the initial black box model is consequently interpreted. Although LIME is powerful and straightforward, it has its drawbacks. In 2020, the first theoretical analysis of LIME [ 46 ] was published, validating the significance and meaningfulness of LIME, but also proving that poor choices in terms of parameters could lead LIME to missing out on important features. Figure 4 illustrates the application of the LIME method, in order to explain the rationale behind the classification of an instance of the Quora Insincere Questions Dataset.

An external file that holds a picture, illustration, etc.
Object name is entropy-23-00018-g004.jpg

Local interpretable model-agnostic explanations (LIME) is used to explain the rationale behind the classification of an instance of the Quora Insincere Questions Dataset.

Zafar and Khan [ 47 ] supported that the random perturbation and feature selection methods that LIME utilises result in unstable generated interpretations. This is because, for the same prediction, different interpretations can be generated, which can be problematic for deployment. In order to address this uncertainty, a deterministic version of LIME, DLIME is proposed. In this version, random perturbation is replaced with hierarchical clustering to group the data and k-nearest neighbours (KNN) to select the cluster that is believed where the instance in question belongs. Using three medical datasets among multiple explanations, they demonstrate the superiority of DLIME over LIME in terms of the Jacart Similarity.

SHAP: Shapley Additive explanations (SHAP) [ 48 ] is a game-theory inspired method that attempts to enhance interpretability by computing the importance values for each features for individual predictions. Firstly, the authors define the class of additive feature attribution methods, which unifies six current methods, including LIME [ 45 ], DeepLIFT [ 29 ], and Layer-Wise Relevance Propagation [ 49 ], which all use the same explanation model. Subsequently, they propose SHAP values as a unified measure of feature importance that maintains three desirable properties: local accuracy, missingness, and consistency. Finally, they present several different methods for SHAP value estimation and provide experiments demonstrating not only the superiority of these values in terms of differentiating among the different output classes, but also in terms of better aligning with human intuition when compared to many other existing methods.

Ancors: in [ 50 ], another model-agnostic interpretability approach that works for any black-box model with a high probability guarantee was proposed. Under this approach, high-precision, if-then rules, called anchors, are created and utilised in order to represent local, sufficient conditions for prediction. More specifically, given a prediction for an instance, an anchor is defined as a rule that sufficiently decides the prediction locally, which means that any changes to other feature values of the instance do not essentially affect the prediction value. The anchors are constructed incrementally while using a bottom-up approach. More specifically, each anchor is initialized with an empty rule, one that applies to every instance. Subsequently, in iterative fashion, new candidate rules are generated and the candidate rule with the highest estimated precision replaces the previous for that specific anchor. If, at any point, the current candidate rule meets the definition of an anchor, the desired anchor has been identified and the iterative process terminates. The authors note that this approach, attempting to discover the shortest anchor, does not directly compute and optimise towards the highest coverage. However, they highlight that such short anchors are likely to have a high coverage. By conducting a user study, the authors demonstrated that anchors not only lead to higher human precision when compared to linear explanations, but they require less effort by the user in both their application and understanding/interpretation.

Originally proposed in [ 51 ], the contrastive explanations method (CEM) is capable of generating, what the authors call, contrastive explanations for any black box model. More specifically, given any input and its corresponding prediction, the method can identify not only which features should be minimally and sufficiently present for that specific prediction to be produced, but also which features what should be minimally and necessarily absent. Many interpretation methods focus on the former part and ignore the features that are minimally, but critically, absent when trying to form an interpretation. However, according to the authors, these absent parts play an important role when it comes to forming interpretations and such interpretations are natural to humans, as demonstrated in domains, such as healthcare and criminology. Luss et al. [ 52 ] extended the CEM framework to images with much richer structure. This was achieved by defining monotonic functions that correspond to, and enable, the introduction of more concepts into an image without the deletion of any existing ones.

Wachter et al. [ 53 ] proposed a lightweight model agnostic interpretability method providing counterfactual explanations, called counterfactuals. A counterfactual explanation of a prediction describes the smallest possible change that can be applied to the feature values, so that the output prediction can be changed to a desired predefined output. The goal of this approach was not to shed light on the inner workings of a black-box system or provide insight on its decision-making, but, instead, to identify and reveal which external factors would require changing in order for the desired output to be produced. The authors highlight that counterfactual explanations, being a minimal interpretability form, are not appropriate in all scenarios and pinpoint that, in cases where the goal is the understanding of a black-box system’s functionality or the rationalisation of automated decisions, using counterfactual explanations alone may even be insufficient. Despite the downsides that are described above, counterfactual explanations can serve as an easy first step that balances between the desired properties of transparency, explainability, and accountability, as well as regulatory business interests.

Van Looveren et al. [ 54 ] underlined some problems with the previous counterfactual approach [ 53 ], most notably that it does not take local, class-specific interpretability into account, as well as that the counterfactual searching process, growing proportionally to the dimensionality of the feature space, can be computationally expensive. In order to address these issues, they proposed an improved faster, model agnostic technique for finding explainable counterfactual explanations of classifier predictions. This novel method incorporates class prototypes, constructed using either an encoder or class specific k-d trees, in the cost function to enable the perturbations to converge much faster to an interpretable counterfactual, hence removing the computational bottleneck and making the method more suitable for practical applications. In order to illustrate the effectiveness of their approach and the quality of the produced counterfactuals, the authors introduced two new metrics focusing on local interpretability at the instance level. By conducting experiments on both image data (MNIST dataset) and tabular data (Wisconsin Breast Cancer dataset), they showed that prototypes help to produce counterfactuals of superior quality. Finally, they pointed out that the perturbation of an input variable implies some notion of distance or rank among the different values of the variable; a notion that is not naturally present in categorical variables. Therefore, producing meaningful perturbations and subsequent counterfactuals for categorical features is not as straightforward. To this end, the authors proposed the use of embeddings, based on pairwise distances between the different values of a categorical variable, and empirically demonstrated the effectiveness of the proposed embeddings when combined with their method on census data.

Protodash: the work that was detailed in [ 55 ] regarding prototypes was extended in [ 56 ] by associating non-negative weightings to prototypes corresponding to their contribution, consequently creating a unifying coherent framework for both prototypes and criticisms/outliers. Moreover, under the proposed framework, since any symmetric positive definite kernel can be used, resulting in objective functions with nice properties. Subsequently, the authors propose ProtoDash, a fast, mathematically sound approximation algorithm for prototype selection that operates under the proposed framework to optimally select prototypes and learn their non-negative weights.

Permutation importance (PIMP) [ 57 ] is a heuristic approach that attempts to correct the feature importance bias through the normalisation of feature importance measures. The method, following the assumption that the random importance of a feature follows some probability distribution, attempts to estimate its parameters. This is achieved by repeatedly permuting the output array of predictions and subsequently measuring the distribution of importance for each variable on the non-permuted output. The derived p-value serves as a proxy to the corrected measure of feature importance. The usefulness of the method was demonstrated while using both simulated and real-word data to improve interpretability. As a result, an improved Random Forest (RF) model, called PIMP-RF, was proposed, which was trained on the most important features, as determined by the PIMP algorithm. PIMP can be used to complement and improve any feature-importance ranking algorithm by assigning p-values to each variable according to their permuted importance, thus improving model performance as well as model interpretability.

L2X [ 58 ] is a real-time instance-wise feature selection method that can also be used for model interpretation. More specifically, given a single training example, it tries to find the subset of its input features that are more informative in terms of the corresponding prediction for that instance. The subset is decided by a feature selector, through variational approximation, which is solely optimised towards maximising the mutual information between input features and the respective label. In the same study, a new measure called post-hoc accuracy was proposed in order to evaluate the performance of the L2X method in a quantitative way. Experiments using both real and synthetic data sets illustrate the effectiveness of the method not only in terms of post-hoc accuracy, but also terms of human-judgment evaluation, especially when it comes to nonlinear additive and feature-switching data sets.

Friedman [ 59 ] proposed PDPs, a visualisation tool that helps to interpret any black box predictive model by plotting the impact of specific features or subsets of features on a the model’s predictions. More specifically, PDPs show how a certain set of features affects the average predicted value by marginalizing the rest of the features (its complement feature set) out. PDPs are usually very simplistic and they do not take all the different feature interactions into account. As a result, most of the time they cannot provide a very accurate approximation of the true functional relationships between the dependent and independent variables. However, they can often reveal useful information, thus greatly assisting in interpreting black box models, especially in cases where most of these interactions are of low order. Although primarily used to identify the partial relationship between a set of given features and the corresponding predicted value, PDPs can also provide visualisations for both single and multi-class problems, as well as for the interactions between features. In Figure 5 , the PDP of a Random Forest model is presented, illustrating the relationship between age (feature) and income percentile (label) while using the Census Income dataset (UCI Machine Learning Repository).

An external file that holds a picture, illustration, etc.
Object name is entropy-23-00018-g005.jpg

PDP of a Random Forest model, illustrating the relationship between age (feature) and income percentile (label) using the Census Income dataset (UCI Machine Learning Repository).

Originally proposed in [ 60 ], ICE plots is a model agnostic interpretability method, which builds on the concept of PDPs and improves it. After highlighting the limitations of PDPs in capturing the complexity of the modeled relationship in the case where of substantial interaction effects are present, the authors present a refinement of he original concept. Under this innovative refinement, each plot illustrates the functional relationship between the predicted value and the feature for individual instances. As a result, given a feature, the entire distribution of individual conditional expectation functions becomes available, which enables the identification of heterogeneities and their extent.

Another method that is closely-related to PDPs is the Accumulated Local Effect (ALE) plots [ 61 ]. ALE plots trying to address the most significant shortcoming of PDPs, their assumption of independence among features, compute the conditional instead of the marginal distribution. More specifically, in order to average over other features, instead of averaging the predictions, ALE plots calculate the average differences in predictions, thus blocking the effect of correlated features.

LIVE [ 62 ] is method that is similar to LIME [ 45 ], as they both utilise surrogate models to approximate local properties of the black box models and produce coefficients of these surrogate models that are subsequently used as interpretations. However, LIVE differentiates itself from LIME in terms of local exploration as well as in terms of handling of interpretable inputs. LIVE does not create an interpretable input space by transforming the input features, but, instead, makes use of the original feature space; artificial datapoints for local exploration are generated by perturbing the datapoint in question, one feature at a time. Because the perturbed datapoints very closely match the original ones, similarities among them are measured while using the identity kernel is employed, while the original features are used as interpretable inputs.

The breakDown method, as proposed in [ 62 ], is similar to SHAP [ 48 ], as both, based on the conditioned responses of a black-box model, attempt to attribute them proportionally to the input features. However, unlike SHAP, in which the contribution of a feature is averaged over all possible conditionings, the breakDown method deals with conditionings in a greedy way, only considering a single series of nested conditionings. This approach, although not as theoretically sound as SHAP, is faster to compute and more natural in terms of interpretation.

ProfWeight: Dhurandhar et al. [ 63 ] proposed transferring knowledge from high-performing pre-trained deep neural networks to a low performing, but interpretable non-complex model to improve its performance. This was achieved by using confidence scores that are indicative of the higher level data representations that were learnt by the intermediate layers of the deep neural network, as weights during the training process of the interpretable, non-complex model.

This concludes the examination of machine interpretability methods that explain the black-box models. A diverse pool of methods, exploiting different kinds of information, have been developed, offering explanations for the different types of models as well as the different types of data, with the majority of the literature focussing on image and text data. That said, there has not been a best-in-class method developed to address every need, as most methods focus on either a specific type of model, or a specific type of data, or their scope is either local or global, but not both. Of the methods presented, SHAP is the most complete method, providing explanations for any model and any type of data, doing so at both a global and local scope. However, SHAP is not without shortcomings: The kernel version of SHAP, KernelSHAP, like most permutation based methods, does not take feature dependence into account, thus often over-weighing unlikely data points and, while TreeSHAP, the tree version of SHAP, solves this problem, its reliance on conditional expected predictions is known to produce non-intuitive feature importance values as features with no impact on predictions can be assigned an importance value that is different to zero.

3.2. Interpretability Methods to Create White-Box Models

This category encompasses methods that create interpretable and easily understandable from humans models. The models in this category are often called intrinsic, transparent, or white-box models. Such models include the linear, decision tree, and rule-based models and some other more complex and sophisticated models that are equally transparent and, therefore, promising for the interpretability field. This work will focus on more complex models, as the linear, decision tree and elementary rule-based models have been extensively discussed in many other scientific studies. A summary of the discussed interpretability methods to create white-box models can be found in Table 3 .

Interpretability Methods to Create White-Box Models.

Ustun and Rudin [ 64 ] proposed Supersparse Linear Integer Models (SLIM), a type of predictive system that only allows for additions, subtraction, and multiplications of input features to generate predictions, thus being highly interpretable.

In [ 65 ], Microsoft presented two case studies on real medical data, where naturally interpretable generalized additive models with pairwise interactions (GA 2 Ms), as originally proposed in [ 66 ], achieved state-of-the-art accuracy, showing that GA 2 Ms are the first step towards deploying interpretable high-accuracy models in applications like healthcare, where interpretability is of utmost importance. GA 2 Ms are generalized additive models (GAM) [ 67 ], but with a few tweaks that set them apart, in terms of predictive power, from traditional GAMs. More specifically, GA 2 Ms are trained while using modern machine learning techniques such as bagging and boosting, while their boosting procedure uses a round-robin approach through features in order to reduce the undesirable effects of co-linearity. Furthermore, any pairwise interaction terms are automatically identified and, therefore, included, which further increases their predictive power. In terms of interpretability, as additive models, GA 2 Ms are naturally interpretable, being able to calculate the contributions of each feature towards the final prediction in a modular way, thus making it easy for humans to understand the degree of impact of each feature and gain useful insight into the model’s predictions.

Boolean Rule Column Generation [ 68 ] is a technique that utilises Boolean rules, either in their disjunctive normal form (DNF) or in their conjunctive normal form (CNF), in order to create predictive models. In this case, interpretability is achieved through rule simplicity: a low number Boolean rules with few clauses and conditions in each clause can more easily be understood and interpreted by humans. The authors highlighted that most column generation algorithms, although efficient, can lead to computational issues when it comes to learning rules for large datasets, due to the exponential size of the rule-space, which corresponds to all possible conjunctions or disjunctions of the input features. As a solution, they introduced an approximate column-generation algorithm that employs randomization in order to efficiently search the rule-space and learn interpretable DNF or CNF classification rules while optimally balancing the tradeoff between classification accuracy and rule simplicity.

Generalized Linear Rule Models [ 69 ], which are often referred to as rule ensembles, are Generalized Linear Models (GLMs) [ 70 ] that are linear combinations of rule-based features. The benefit of such models is that they are naturally interpretable, while also being relatively complex and flexible, since rules are able to capture nonlinear relationships and dependencies. Under the proposed approach, a GLM is re-fit as rules are created, thus allowing for existing rules to be re-weighted, ultimately producing a weighted combination of rules.

Hind et al. [ 71 ] introduced TED, a framework for producing local explanations that satisfy the complexity mental model of a domain. The goal of TED is not to dig into the reasoning process of a model, but, instead, to mirror the reasoning process of a human expert in a specific domain, who effectively creates an domain-specific explanation system.

In summary, not a lot of progress has been made in recent years towards developing white-box models. This is most likely the result of the immense complexity modern applications require, in combination with the inherent limitations of such models in terms of predictive power—especially in computer vision and natural language processing, where the difference in performance when compared to deep learning models is unbridgeable. Furthermore, because models are increasingly expected to perform well on more than one tasks and transfer of knowledge from one domain to another is becoming a common theme, white-box models, currently being able to perform well only in a single task, are losing traction within the literature and they are dropping further behind in terms of interest.

3.3. Interpretability Methods to Restrict Discrimination and Enhance Fairness in Machine Learning Models

Because machine learning systems are increasingly adopted in real life applications, any inequities or discrimination that are promoted by those systems have the potential to directly affect human lives. Machine Learning Fairness is a sub-domain of machine learning interpretability that focuses solely on the social and ethical impact of machine learning algorithms by evaluating them in terms impartiality and discrimination. The study of fairness in machine learning is becoming more broad and diverse, and it is progressing rapidly. Traditionally, the fairness of a machine learning system has been evaluated by checking the models’ predictions and errors across certain demographic segments, for example, groups of a specific ethnicity or gender. In terms of dealing with a lack of fairness, a number of techniques have been developed both to remove bias from training data and from model predictions and to train models that learn to make fair predictions in the first place. In this section, the most widely-used machine learning fairness methods are presented, discussed and finally summarised in Table 4 .

Interpretability Methods to Restrict Discrimination and Enhance Fairness in Machine Learning Models.

Disparate impact testing [ 72 ] is a model agnostic method that is able to assess the fairness of a model, but is not able to provide any insight or detail regarding the causes of any discovered bias. The method conducts a series of simple experiments that highlight any differences in terms of model predictions and errors across different demographic groups. More specifically, it can detect biases regarding ethnicity, gender, disability status, marital status, or any other demographic. While straightforward and efficient when it comes to selecting the most fair model, the method, due to the simplicity of its tests, might fail to pick up on local occurrences of discrimination, especially in complex models.

A way to ensure fairness in machine learning models is to alter the model construction process. In [ 73 ], three different data preprocessing techniques to ensure fairness in classification tasks are analysed. The first presented technique, which is called suppression, detects the features that correlate the most, according to some threshold with any sensitive features, such as gender or age. In order to diminish the impact of the sensitive features in the model’s decisions, the sensitive features along with their most correlated features are removed before training. This forces the model to learn from and, therefore, base its decisions on other attributes, thus not being biased against certain demographic segments. The second technique is called “massaging the dataset” and it was originally proposed in [ 74 ]. In order to remove the discrimination from the input data, according to this technique, some relabelling is applied to of some instances in the dataset. First, using a ranker, the instances most likely to be victims (discriminated ones) or most likely to be profiters (favoured ones) are detected, according to their probability of belonging to the corresponding class without taking the sensitives attributes into account. Subsequently, their labels are changed and a classifier is trained on this modified data that is free of bias. Finally the idea behind the third preprocessing technique, as initially presented in [ 75 ], is to apply different weights to the instances of the dataset based on frequency counts with respect to the sensitive column. The weight of an instance is calculated as the expected probability that its sensitive feature value and class appear together, while assuming that they are independent, divided by the respective observed probability. Reweighing has a similar effect to the “massaging” approach, but its major advantage is that it does alter the labels of the dataset. Similarly to disparate impact testing, the described data preproecessing methods might fail to pick up on local occurrences of discrimination, especially in complex models.

Another data preprocessing technique for removing the bias from machine learning models was proposed in [ 76 ]. More specifically, having the following three goals in mind: controlling discrimination, limiting distortion in individual instances, and preserving utility, the authors derived a convex optimization for learning a data representation that captures these goals.

Adversarial debiasing [ 77 ] is a framework for mitigating biases concerning demographic segments in machine learning systems by selecting a feature regarding the segment of interest and, subsequently, training a main model and an adversarial model simultaneously. The main model is trained in order to predict the label, whereas the adversarial model based on the main model’s prediction for each instance tries to predict the demographic segment of the instance; the objective is to maximize the main model’s ability to correctly predict the label, while, at the same time, minimizing the adversarial model’s ability to predict the demographic segment in question. Adversarial debiasing can be applied to both regression and classification tasks, regardless of the complexity of the chosen model. With regards to the sensitive features of interest, both continuous and discrete values can be handled and any imposed constraints can enforced across multiple definitions of fairness.

Kamiran et al. [ 78 ] pointed out that many of the methods that make classifiers aware of discriminatory biases require data modifications or algorithm tweaks and they are not very flexible with respect to multiple sensitive feature handling and control over the performance vs. discrimination trade-off. As a solution to these problems, two new methods methods that utilise decision theory in order to create discrimination-aware classifiers were proposed, namely Reject Option based Classification (ROC) and Discrimination-Aware Ensemble (DAE), neither of which require any data preprocessing or classifier adjustments. ROC can be viewed as a cost-based classification method, in which misclassifying an instance of a non-favoured group as negative results is much higher punishment than wrongly predicting a member of a favored group as negative. DAE employs an ensemble of classifiers. Ensembles, by nature, can be very useful in reducing bias. According to the authors, this is because the greater the number of classifiers and the more diverse these classifiers are, the higher the probability that some of them will be fair. Under this assumption, the discrimination awareness of such an ensemble can be controlled by adjusting the diversity of its voting classifiers, while the trade-off between accuracy and discrimination in DAEs depends on the disagreements between the voting classifiers and number of instances that are incorrectly classified.

Liu et al. [ 79 ] highlighted that most work in machine learning fairness had mostly studied the notion of fairness within static environments, and it had not been concerned with how decisions change the underlying population over time. They argued that seemingly fair decision rules have the potential to cause harm to disadvantaged groups and presented the case of loan decisions as an example where the introduction of seemingly fair rules can all decrease the credit score of the affected population over time. After emphasising the importance of temporal modelling and continuous measurement in evaluating what is considered fair, they concluded that in order for fairness rules to be set, rather than just considering what seems to be fair at a stationary point, an approach that takes the long term effects of such rules on the population in dynamic fashion into consideration is needed.

The problem of algorithmically allocating resources when in shortage was studied in [ 80 ] and, more specifically, the notion of fairness within this procedure in terms of groups and the potential consequences. An efficient learning algorithm is proposed that converges to an optimal fair allocation, even without any prior knowledge of the frequency of instances in each group; only the number of instances that received the resource in a given allocation is known, rather than the total number of instances. This can be translated to the fact that the creditworthiness of individuals not given loans is not known in the case of loan decisions or to the fact that some crimes committed in areas of low policing presence are not known either. As an application their framework, the authors considered the predictive policing problem, and experimentally evaluated their proposed algorithm on the Philadelphia Crime Incidents dataset. The effectiveness of the proposed method was proven, as, although trained on arrest data that were produced by its own predictions for the previous days, potentially resulting in feedback loops, the algorithm managed to overcome them.

Feedback loops in the context of predictive policing and the allocation of policing resources were also studied in [ 81 ]. More specifically, the authors first highlighted that feedback loops are a known issue in predictive policing systems, where a common scenario includes police resources being spent repeatedly on the same areas, regardless of the true crime rate. Subsequently, they developed a mathematical model of predictive policing which revealed the reasons behind the occurrence of feedback loops and showed that a relationship exists between the severity of problems that are caused by a runaway feedback loop and the variance in crime rates among area. Finally, upon acknowledging that incidents reported by citizens can alleviate the impact of runaway feedback, the authors demonstrated ways of altering the model inputs, though which predictive policing systems, which are able to overcoming the runaway feedback loop and, therefore, capable of learning the true crime rate, can be produced.

Models of strategic manipulation is a category of models that attempt to capture the dynamics between a classifier and agents in an environment, where all of the agents are capable, to the same degree, of manipulating their features in order to deceit the classifier into making a decision in their favour. In real world social environments, however, an individual’s ability to adapt to an algorithm does not merely relate to their personal benefit of getting a favourable decision, instead it heavily depends on a number of complex social interactions and factors within the environment. In [ 82 ] strategic manipulation models were studied and adapted in an environment of social inequality, in which different social groups have to pay different costs of manipulation. It was proven that, in such a setting, a decision making model exhibited a behaviour where some members of the advantaged group incorrectly received a favourable decision, while some members of the disadvantaged group incorrectly received a non-favourable one. The results also demonstrated that any tools attempting to evaluate an individual’s fitness or eligibility can potentially have harmful social consequences when the individuals’ capacities to adaptively respond differ. Finally, the authors conclude that the increasing use of decision-making machine learning tools in our imperfect social world will require the design of suitable models and the development of a sound theoretical background that would explicitly address critical issues, such as social stratification and unequal access, in order for true fairness to be achieved.

Milli et al. [ 83 ] also studied how individuals adjust their behaviour strategically to manipulate decision rules in order to gain favourable treatment from decision-making models. They reiterated that the design of more conservative decision boundaries in an effort to enhance robustness of decision making systems against such forms of distributional shift is significantly needed in order for fairness to be achieved. However, the authors showed, through experimentation, that although stricter decision boundaries add benefit to the decision maker, this is done at the expense of the individuals being classified. There is, therefore, some trade-off between the accuracy of the decision maker and the impact to the individuals in question. More specifically, a notion of “social burden” was introducedin order to quantify the cost of strategic decision making as the expected cost that a positive individual needs to meet to be correctly classified as positive, and it was proven that any increase in terms of the accuracy of the decision maker necessarily corresponds to an increase in the social burden for the individuals. Lastly, they empirically demonstrated that any extra costs occurring for individuals have the potential to be disproportionately harmful towards the already disadvantaged groups of individuals, highlighting that any strategy towards more accurate decision making must also weigh in social welfare and fairness factors.

Counterfactual fairness, which is defined strictly in [ 84 ], attempts to capture the intuition that a decision affecting an individual is fair if it would affect the same individual in the same way both the actual world and in a counterfactual world, where the individual would be a member of a different demographic group. In the same study, it was argued that it was crucial for causality in fairness to be addressed and subsequently a framework for modeling fairness using tools from causal inference was proposed. According to the authors, any measures of causality in fairness measures should not only consist of quantities free of counterfactuals, but is also essential that counterfactual causal guarantees are pursued. The proposed framework, which is based on the idea of counterfactual fairness, allows for the users to produce models that, instead of merely ignoring sensitive attributes that potentially reflect social biases towards individuals, are able to take such features into consideration and compensate for them accordingly.

The fairness of word embeddings, a vectorised representation of text data, used in many real world machine learning application, was studied in [ 85 ] and it was revealed that word embeddings, even those that were trained on Google News articles, carry strong gender bias. More specifically, two very useful, in terms of embedding debiasing, properties were shown. Firstly, it was shown that there exists a direction in the embedding space towards which gender stereotypes can be captured. Secondly, it was shown that gender neutral words can be linearly separated from gender definition words in the embedding space. Subsequently, metrics for quantifying both the direct and indirect gender stereotypes present in the word embeddings were created and an algorithm that utilises the previous two properties and tweaks the embedding vectors in order for gender bias to be removed was proposed by the authors.

According to [ 86 ], fairness should be realised not only segment-wise, but also at an individual level. In order to achieve this, fairness was formulated into a data representation problem, where any representations learnt would need to be optimised towards two competing objectives: similar individuals should have similar encodings; however, such encodings should be ignorant of any sensitive information regarding the individual.

In [ 87 ], three approaches for making the naive Bayes classifier discrimination-free were proposed. The first approach was to regulate the conditional probability distribution of the sensitive feature values given that the label is positive, by simply boosting the probability of the disadvantaged sensitive feature values given the positive label, while, at the same time, decreasing the probability of the favoured sensitive feature values given the positive label. While easy to follow and implement, this approach brings the downside of either reducing or boosting the number of positive labels that are produced by the model, depending on the difference between the frequency of the favoured sensitive values and frequency of the discriminated sensitive values in the input data. The second approach involves training a different model for every sensitive attribute value. The case where a sensitive feature has two values, and, therefore, two models were trained, was illustrated: one model was developed using only the rows that had a favoured sensitive value, while another model only utilised the rows that had a discriminated sensitive value. The different models are part of a meta-model, where discrimination is mitigated by adjusting the conditional probability, as described in the first approach. In the third approach, a latent variable is introduced to the modelling procedure, which corresponds to the non-biased label and the model parameters were optimized towards likelihood-maximisation while using the expectation-maximization (EM) algorithm.

In [ 88 ], a framework for fair classification, which consisted of two parts, was presented. More specifically, the first part involves the development of a task-specific metric in order to evaluate the degree of similarity among individuals with respect to the classification task, whereas the second part consists of an algorithmic procedure that is capable of maximizing the objective function, subject to the fairness constraint, according to which, similar decisions should be made for similar individuals. Furthermore, the framework was adjusted towards the much related goal of guaranteeing statistical parity, while, as previously, ensuring that similar individuals are provided with analogous decisions. Finally, the close relationship between privacy and fairness was discussed and, more specifically, how fairness can be further promoted using tools and approaches developed within the framework of differential privacy.

The difference between the fairness of the decision making process, also known as procedural fairness, and the fairness of the decision outcomes, also known as distributive fairness, was brought up by the authors of [ 89 ], who also emphasised that the majority of the scientific work on machine learning fairness revolved around the latter. For this gap to be bridged, procedural fairness metrics were introduced in order for the impact of input features used in the decision to be taken into consideration and for the moral judgments of humans regarding the use of these features to be quantified.

Building on from [ 90 ], where the concept of meritocratic fairness was introduced, Kearns et al. [ 91 ] performed a more comprehensive analysis on the broader issue of realising superior guarantees in terms of performance, while relaxing the model assumptions. Furthermore, the issue of fairness in infinite linear bandit problems was studied and a scheme for meritocratic fairness regarding online linear problems was produced, which was significantly more generic and robust than the existing methods. Under this scheme, fairness is satisfied by ensuring optimality in terms of reward: no actions that lead to preferential treatments are taken, unless the algorithm is certain that the reward of such an action would be higher reward. In practice, this is achieved by calculating confidence intervals around the expected rewards for the different individuals and, based on this process, two individuals are said to be linked if their corresponding confidence intervals are overlapping, and chained if they can reach each other through a chain of intermediate linked individuals.

The fact that the majority of notions or definitions of machine learning fairness merely focus on predefined social segments was criticised in [ 96 ]. More specifically, it was highlighted that such simplistic constraints, while forcing classifiers to achieve fairness at segment-level, can potentially bring discrimination upon sub-segments that consist of certain combinations of the sensitive feature values. As a first step towards addressing this, the authors proposed defining fairness across an exponential or infinite number of sub-segments, which were determined over the space of sensitive feature values. To this end, an algorithm that produces the most fair, in terms of sub-segments, distribution over classifiers was proposed. This is achieved by the algorithm through viewing the sub-segment fairness as a zero-sum game between a Learner and an Auditor, as well as through a series of heuristics.

Following up from other studies demonstrating that the exclusion of sensitive features cannot fully eradicate discrimination from model decisions, Kamishima et al. [ 99 ] presented and analysed three major causes of unfairness in machine learning: prejudice, underestimation, and negative legacy. In order to address the issue of indirect prejudice, a regulariser that was capable of restricting the dependence of any probabilistic discriminative model on sensitive input features was developed. By incorporating the proposed regulariser to logistic regression classifiers, the authors demonstrated its effectiveness in purging prejudice.

In [ 92 ], a framework for quantifying and reducing discrimination in any supervised learning model was proposed. First, an interpretable criterion for identifying discrimination against any specified sensitive feature was defined and a formula for developing classifiers that fulfil that criterion was introduced. Using a case study, the authors demonstrated that, according to the defined criterion, the proposed method produced the Bayes optimal non-discriminating classifier and justified the use of postprocessing over the altering of the training process alternative by measuring the loss that results from the enforcement of the non-discrimination criterion. Finally, the potential limitations of the proposed method were identified and pinpointed by the authors, as it was shown that not all dependency structures and not all other proposed definitions or intuitive notions of fairness can be captured while using the proposed criterion.

Pleiss et al. [ 97 ], building on from [ 92 ], studied the problem of producing calibrated probability scores, the end goal of many machine learning applications, while, at the same time, ensuring fair decisions across different demographic segments. They demonstrated, through experimentation on a diverse pool of datasets, that probability calibration is only compatible with cases where fairness is pursued with respect to a single error constraint and concluded that maintaining both fairness and calibrated probabilities, although desirable, is often nearly impossible to achieve in practice. For the former cases, a simple postprocessing technique was proposed that calibrates the output scores, while, at the same time, maintaining fairness by suppressing the information of randomly chosen input features.

Celis et al. [ 98 ] highlighted that, although efforts have been made in recent studies to achieve fairness with respect to some particular metric, some important metrics have been ignored, while some of the proposed algorithms are not supported by a solid theoretical background. To address these concerns, they developed a meta-classifier with strong theoretical guarantees that can handle multiple fairness constraints with respect to multiple non-disjoint sensitive features, thus enabling the adoption and employment of fairness metrics that were previously unavailable.

In [ 94 ], a new metric for evaluating decision boundary fairness both in terms of disparate treatment and disparate impact at the same time, with respect to one or more sensitive features was introduced. Furthermore, utilising this metric, the authors designed a framework comprising of two contrasting formulations: the first one optimises for accuracy subject to fairness constraints, while the second one optimises towards fairness subject to accuracy constraints. The proposed formulations were implemented for logistic regression and support vector machines and evaluated on real-world data, showing that they offer fine-grained control over the tradeoff between the degree of fairness and predictive accuracy.

Following up from their previous work [ 94 ], Zafar et al. [ 93 ] introduced a novel notion of unfairness, which was defined through the rates of misclassification, called disparate mistreatment. Subsequently, they proposed intuitive ways for measuring disparate mistreatment in classifiers that rely on decision boundaries to make decisions. By experimenting on both synthetic and real world data, they demonstrated how easily the proposed measures can be converted into optimisation constraints, thus incorporated in the training process, and how well they work in terms of reducing disparate mistreatment, while maintaining high accuracy standards. However, they warned of the potential limitations of their method due to the absence of any theoretical guarantees on the global optimality of the solution as well as due to the the approximation methods used, which might prove to be inaccurate when applied to small datasets.

In another work by Zafar et al. [ 100 ], it was pointed out that many of the existing notions of fairness, regarding treatment or impact, are too rigorous and restrictive and, as a result, tend to hinder the overall model performance. In order to address this, the authors proposed notions of fairness that are based on the collective preference of the different demographic groups. More specifically, their notion of fairness tries to encapsulate which treatment or outcome would the different demographic groups prefer when given a list of choices to pick from. For these preferences to be taken into consideration, proxies that capture and quantify them were formulated by the authors and boundary-based classifiers were optimised with respect to these proxies. Through empirical evaluation, while using a variety of both real-world and synthetic datasets, it was illustrated that classifiers pursuing fairness based on group preferences achieved higher predictive accuracy than those seeking fairness through strictly defined parity.

Agarwal et al. [ 95 ] introduced a systematic framework that incorporates many other previously outlined definitions of fairness, treating them as special cases. The core concept behind the method is to reduce the problem of fair classification to a sequence of fair classification sub-problems, subject to the given constraints. In order to demonstrate the effectiveness of the framework, two specific reductions that optimally balance the tradeoff between predictive accuracy and any notion of single-criterion definition of fairness were proposed by the authors.

In Figure 6 and Figure 7 , the use of machine learning interpretability methods to reduce discrimination and promote fairness is presented. More specifically, in Figure 6 parity testing is applied using the aequitas library on the ProPublica COMPAS Recidivism Risk Assessment dataset, whereas in Figure 7 , a comparison of the level of race bias (bias disparity) among different groups in the sample population is shown.

An external file that holds a picture, illustration, etc.
Object name is entropy-23-00018-g006.jpg

Parity testing, using the aequitas library, on the ProPublica COMPAS Recidivism Risk Assessment dataset, with three metrics: False Positive Rate Disparity, False Discovery Rate Disparity, and True Positive Rate Disparity.

An external file that holds a picture, illustration, etc.
Object name is entropy-23-00018-g007.jpg

Comparison of the level of race bias (bias disparity) among different groups in the sample population.

In conclusion, fairness is a relatively new domain of machine learning interpretability, yet the progress made in the last few years has been tremendous. Various methods have been created in order to protect disadvantaged demographic segments against social bias and ensure fair allocation of resources. These different methods concern data manipulations prior to model training, algorithmic modifications within the training process as well as post-hoc adjustments. However, most of these methods, regardless of which step of the process they are applied, focus too much on group-level fairness and often ignore individual-level factors both within the groups and at a global scale, potentially mistreating individuals in favour of groups. Furthermore, only a tiny portion of the scientific literature is concerned with fairness in non-tabular data, such as images and text; therefore, a large gap exists in these unexplored areas that are to be filled in the coming years.

3.4. Interpretability Methods to Analyse the Sensitivity of Machine Learning Model Predictions

This category includes interpretability methods that attempt to assess and challenge the machine learning models in order to ensure that their predictions are trustworthy and reliable. These methods apply some form of sensitivity analysis, as models are tested with respect to the stability of their learnt functions and how sensitive their output predictions are with respect to subtle yet intentional changes in the corresponding inputs. Sensitivity analysis can be both a global and local interpretation technique, depending on whether the change to the output is analysed with respect to a single example or across all examples in the dataset. Traditional and adversarial example-based sensitivity methods are presented and discussed in this section, while their corresponding summaries are provided in Table 5 and Table 6 respectively.

Traditional Sensitivity Analysis Methods.

Adversarial Example-based Sensitivity Analysis.

3.4.1. Traditional Sensitivity Analysis Methods

Traditional sensitivity analysis methods try to represent each input variable with a numeric value, which is called the sensitivity index. Sensitivity indices can be first-order indices, measuring the contribution of a single input variable to the output variance or second, third of higher-order indices, measuring the contribution of the interaction between two, three, or more input variables to the output variance respectively. the total-effect indices, combining the contributions of first-order and higher-order interactions with respect to the output variance.

An output variance sensitivity analysis that is based on the ANOVA decomposition was formalised by Sobol, who proposed the approximation of sensitivity indices of first and higher order while using Monte-Carlo methods [ 101 ], while Saltelli [ 102 ] and Saltelli et al. [ 103 ] improved upon Sobol’s approach while using more efficient sampling techniques for first, higher, as well as total-effect indices.

Cukier et al. [ 104 ] proposed the Fourier Amplitude Sensitivity Test (FAST) method to improve the approximation of Sobol’s indices. This is achieved by applying a Fourier transformation to transform a multi-dimensional integral into a one-dimensional integral with different transformations leading to different distributions of sampled points. Saltelli et al. [ 105 ] improved upon FAST to compute the total-effect indices, while Tarantola et al. [ 106 ] extended random balance designs, applied by Satterthwait in regression problems, to sensitivity analysis for non-linear, non-additive models by combining them with FAST (RBD-FAST). The RBD-FAST method was further improved in terms of computational efficiency by Plischke [ 107 ], while Tissot et al. introduced a bias correction method in order to improve estimation accuracy [ 108 ].

Another method for global sensitivity analysis is that of Morris [ 110 ], which is often referred to as the one-step-at-a-time method (OAT). Under this approach, the input variables are split into three groups: input variables whose contributions are insignificant, inputs that have significant linear effects of their own without any interactions, and inputs that have significant non-linear and/or interaction effects. This is achieved through discretising the input space for each variable and iteratively making a number of local changes (one at a time) at different points for the possible range of input values. The Morris method, while complete, is very costly and, as a result, in some cases, fractional factorial designs, as described in [ 109 ], need to be formulated and employed in practice, in order for sensitivity analysis to be performed more efficiently. By devising a more effective sampling strategy as well as other improvements, Campolongo et al. [ 111 ] proposed an improved version of Morris’s method.

In some cases, variance is not a good proxy for the variability of the distribution. As a result, some studies have focused on developing sensitivity indices that are not based on variance, which are often referred to as moment independent importance measures, requiring no calculation of the output moments. One example is Borgonovo’s [ 112 ] distribution or density based sensitivity indices, which measure the distance or the divergence between the unconditional output distribution and output distribution when conditioned on one or more input variables. Building on from the work of Borgonovo, Plischke et al. [ 113 ] introduced a new class of estimators for approximating density-based sensitivity measures, independent of the sampling generation method used.

Being introduced by Sobol and Kucherenko [ 114 ], the method of derivative-based global sensitivity measures (DGSM) is based on finding the averages of local derivatives while using Monte Carlo or Quasi Monte Carlo sampling methods. DGSM, which can be seen as the generalization of the Morris method, are much easier to implement and evaluate when compared to the Sobol sensitivity indices.

3.4.2. Adversarial Example-based Sensitivity Analysis

Adversarial examples are datapoints whose features have been perturbed by a subtle yet sufficient amount, enough to cause a machine learning model make incorrect predictions about them. Adversarial examples are like counterfactual examples; however, they do not focus on explaining the model, but on misleading it. Adversarial example-based sensitivity analysis methods are methods that create adversarial examples for different kinds of data such as images or test.

It was Szegedy et al. [ 115 ] who first discovered that the functions learnt by deep neural networks can be significantly discontinuous, thus their output is very fragile to certain input perturbations. The term “adversarial examples” was coined for such perturbations and it was found that adversarial examples can be shared among neural networks with different architectures, trained on different subsets, disjoint or not, of the same data: the very same input perturbations that caused one network to misclassify can cause a different network to also alter its output dramatically. The problem of finding the minimal necessary perturbations was formulated as a box-constrained L 2 -norm optimisation problem and the L-BFGS optimisation algorithm was employed in order to approximate its solution. Goodfellow et al. [ 116 ] argued that high-confidence neural network misclassifications that are caused by small, yet intentionally, worst-case datapoint perturbations, were not due to nonlinearity or overfitting, but instead due to neural networks’ linear nature. In addition to their findings, they also proposed a fast and simple yet powerful gradient-based method of generating adversarial examples while using the L ∞ norm, called fast gradient sign method (FGSM). Figure 8 illustrates the effectiveness of the FGSM method, where instances of the MNIST dataset are perturbed while using different values of ϵ , resulting in the model misclassifying them.

An external file that holds a picture, illustration, etc.
Object name is entropy-23-00018-g008.jpg

Fast Gradient Sign Attack (FGSM) on the MNIST dataset. The first row contains unperturbed images, while in the subsequent rows are perturbed using some ϵ value, resulting in the model misclassifying them.

In order to test the sensitivity of deep learning models, Moosavi-Dezfooli et al. proposed DeepFool [ 117 ], a method that generates minimum-perturbation adversarial examples that are optimised for the L 2 norm. By making the simplifying assumptions, DeepFool employs an iterative process of classifier linearisation, producing adversaries that work well against both binary and multi-class classifiers. Moosavi-Dezfooli et al. [ 118 ] also came up with a formulation that is able to produce a single perturbation, such that the classifier mis-classifies most of the instances. The existence of these so called “universal adversarial examples” exposed the inherent weaknesses of deep neural networks on all of the inputs. Papernot et al. [ 119 ] conducted a thorough investigation regarding the adversarial behaviour within the deep learning framework and proposed a new class of algorithms able to generate adversarial instances. More specifically, the method exploiting the mathematical relationship between the inputs and outputs of deep neural networks to compute forward derivatives and subsequently construct adversarial saliency maps. Finally, the authors pointed towards the development and utilisation of a distance metric between non-adversarial inputs and the corresponding target labels as a way to defend against adversarial examples. Kurakin et al. [ 120 ] highlighted that, although in most studies regarding machine learning sensitivity it is assumed the adversary examples can be input directly into the classifier, this assumption does not always hold true for classifiers engaging with the physical world, such as those receiving input in the form of signals from other devices. To this end, among the other methods used, a new method that improves upon the FGSM [ 116 ] algorithm was introduced, whereby the FGSM was repeated many times with small step size, truncating the intermediate results after each step in the process, so that the produced adversarial examples (pixels in this case) are within close range of the original examples. Dong et al. [ 121 ] promoted the use of momentum in oder to enhance the process of creating adversarial instances while using iterative algorithms, thus introducing the a broad class of adversarial momentum-based iterative algorithms. Momentum is well known to help iterative optimisation algorithms, such as gradient descent, in order to stabilise gradients and escape from local minima/maxima.

NATTACK: instead of seeking an optimal adversarial example Li et al. [ 122 ] considered fitting a probability distribution in a neighbourhood centered around a given example, with the assumption being that any example generated from this distribution is a good adversary candidate. The proposed approach offers two distinct benefits: first, it can be employed to attack any model and, secondly, it does not require of any knowledge of the model’s internal workings.

Carlini and Wagner [ 123 ] introduced three novel adversarial attack algorithms, based on the L 0 , L 2 , and L ∞ norms, respectively, which were very effective against neural networks, even those where defensive distillation technique [ 124 ] had been applied. The proposed attacks aim to address the same minimal perturbation problem as Szegedy et al [ 115 ], but they formulate it using the margin loss instead of cross-entropy loss, thus minimising the distance between adversarial and benign examples in a more direct way. In [ 125 ], Carlini et al. demonstrated how to construct a provable strongest attack, also called the ground truth attack. The problem of finding adversarial examples proven to be of minimal distortion was formulated as a linear-like optimisation problem. The deduced adversarial example, having the most similarity to the original instance, is called the ground truth adversarial example.

Spatially Transformed Attack: Xiao et al. [ 126 ] proposed perturbing images by performing a slight spatial transformation such as translating, rotating and/or distorting the image features. Such perturbations are small enough to escape human attention but are able to trick models.

One-pixel Attack: Su et al. [ 127 ] showed how neural networks can be fooled by altering the value of just a single input pixel. By constraining the L 0 norm, they enforced a limit on the number of pixels that were allowed to be perturbed.

Zeroth order optimisation based attack (ZOO): assuming that the one has access to the prediction probability scores (rather than just the predicted labels) of a classifier and the respective inputs, Chen et al. [ 128 ] proposed an algorithm to infer the gradient information by observing the changes in the prediction scores, thus eliminating the need for a substitute model when creating adversarial examples.

In their study [ 129 ], Narodytska et al. focused on generating adversarial examples for any deep convolutional neural network without prior knowledge of the internal workings of the network in question. To this end, they proposed two pixel-perturbing methods that operate without using any gradient information: the first one is to randomly select and perturb a set of pixels, while the second one improves upon the first one by incorporating a greedy local-search algorithm to efficiently locate a better set of pixels to perturb. Introduced in [ 130 ], HopSkipJumpAttack is a group of adversarial-example generating algorithms that rely on binary information regarding the decision boundary and Monte Carlo methods in order to approximate the direction of the gradient. The method is able to produce both targeted and non-targeted examples that are optimised for the L2 and L ∞ norms.

Liu et al. [ 131 ] performed a thorough investigation on the transferability of both non-targeted and targeted adversarial examples while using models and datasets of large scale, concluding that while transferring non-targeted adversarial examples can be very effective in fooling neural networks, targeted adversarial examples do not work as well. To this end, they proposed new ways of producing effective, transferable adversarial examples, both targeted and non-targeted, with a high success rate when tested against a black-box image classification model. Houdini [ 132 ] is an approach that was proposed by Cisse et al. that is able to produce adversarial instances for any specified task, according to the respective measure of performance. Houdini’s adversarial attacks were employed with success to a variety of structured prediction tasks, including the typical image classification challenge, but also extending the use of adversarial examples to other problems, such as voice recognition, pose estimation, and semantic segmentation. Finally, it should not be left unnoted that, in terms of measures of performance for the different tasks, Houdini is capable of handling complex measures, even non-decomposable ones as well as combinations of measures. In [ 133 ], a novel approach that uses an elastic net-based regularisation framework (the combination of the L 1 and L 2 norms) to generate adversarial instances against deep neural networks was proposed. Empirical results on three different image datasets showed that the proposed framework was able to produce adversarial examples that can break through the defensive distillation technique and have high transferablity. Lastly, the inner workings of the method and its way of exploiting the L 1 norm revealed new useful insights behind the relationship between the L 1 norm and generation of effective adversarial examples. Papernot et al. [ 134 ] proposed a novel method for generating adversarial examples by examining the inputs that were provided to a deep neural network and the corresponding the labels that were assigned by the network. The method consists of training a model using synthetic instances, generated by an adversary, as input and the neural network’s predictions of these instances as the true labels. The trained model is subsequently used to create adversarial examples to attack the neural network. Such examples would be misclassified not only by the trained model, but also by the neural network, as, by definition, they would have similar decision boundaries.

Brendel et al. [ 135 ] highlighted the lack of scientific studies regarding decision-based adversarial attacks and pinpointed to the benefits and the versatility of such attacks, namely that they can be used against any black-box model, require only the observing of the model’s final decisions, are easier to implement compared to transfer-based attacks, and, at the same time, are more effective against simple defences when compared to gradient-based or score-based attacks. To support their arguments, they introduced the so-called Boundary Attack, a decision-boundary based adversarial attack, which, in principle, begins with creating adversarial instances of high degree perturbations and, subsequently, decreasing the level of perturbation. More specifically, through a rejection process, the method learns the decision boundary between non-adversarial and adversarial instances and, with this knowledge, is able to generate effective adversaries. Brendel et al. [ 136 ] also developed a novel family of gradient-based adversarial attacks that not only performed better than previous gradient-based attacks, but were more effective against gradient-masking, more efficient in terms of querying the defended model, and able to optimise for a variety of adversarial criteria. Unlike other methods that explore areas far away from the decision boundary and, as a result, might get stuck, the point-wise attack only stays in areas close the boundary, where gradient signals are more reliable, in order to minimise the distance between the adversarial and original example. Koh and Liang [ 137 ] proposed an indirect method of generating adversarial examples. The proposed method is capable of explicitly calculating what the difference would be in the final loss if one training example was altered without retraining the model. By identifying the training instances with the highest impact on model’s predictions, powerful adversaries can be deducted.

In the works of Zugner et al. [ 138 ] and Dai et al. [ 139 ], adversarial examples in graph-structured data were studied. The former method is a greedy approach that is concerned with attacking node classification models, through the modification of the node connections (add/remove edges between nodes) or node features (flip the feature of nodes with limited number of operations). Three different settings were considered: manipulation of all nodes in the graph, of a set of nodes, including the node in question, and a set of nodes excluding the node in question. The latter attack method is based on a reinforcement learning formulation of the problem and, more specifically, a Q-Learning game. Under this approach only the addition and removal of edges is allowed when altering the graph structure.

In [ 140 ], a graph attack based on meta-learning was proposed. Meta-learning has been historically employed for fast reinforcement learning, hyperparameter tuning, and few-shot image recognition. In this scenario, the graph structure of the network was used as input to a meta-learning algorithm as the hyperparameter to be optimised.

Sharif et al. [ 141 ] proposed a method for fooling face recognition neural networks by modifying the original images through the insertion of 3D printed sunglasses in the original face images. The colour of these glasses was optimised towards leading the neural network to mis-classify the faces in question. Hayes and Danezis [ 142 ] introduced a generative universal adversarial example framework, whereby image perturbations are produced by a generative model, such that, when incorporated into a normal, non-adversarial instance, they transform it to an adversarial instance. Because the generator is not conditioned on the given images, the generated perturbations can be applied to any image and then transform it into an adversarial one. Schott et al. also developed a high-accuracy, robust against adversarial attacks, image classification model that utilises the analysis by synthesis approach [ 143 ]. More specifically, for each instance in the datase, a lower bound of the ELBO loss given each class is calculated and, subsequently, these class-conditional ELBOs are synthesised in order to produce the final prediction. Furthermore, two new attacks were developed: one specifically tailored to work well against the proposed model by exploiting its structure and a decision-based attack that optimises towards the smallest number of perturbed pixels.

In noise-based adversarial attacks, original examples are perturbed with the addition of some form of noise and before being passed as input to a machine learning model. However, in many cases, this addition of noise can cause some input values to fall outside their originally defined domain and therefore clipping is required if they are to passed to the model. The proposed clipping methods prior to [ 144 ] were relatively slow and they only provided approximations to the optimal solution, thus diminishing the effectiveness of the produced. adversarial examples. In order to improve both the effectiveness and speed of the previously proposed clipping methods, Rauber and Bethge [ 144 ] proposed a fast and differentiable algorithm to rescale perturbation vectors, under which a perturbation with the desired norm after clipping can be analytically calculated while using a closed form solution.

Adversarial example vulnerability also exists in deep reinforcement learning modelling, as demonstrated by Huang et al. [ 145 ]. By employing the FGSM method [ 116 ], the authors created adversarial states to manipulate the network’s policy. They showed that even slight state perturbations can potentially lead to very significant differences in terms of performance and decisions.

Yang et al. [ 146 ] focussed on generating adversarial examples for discrete data such as text. Firstly, a two-step greedy approach that locates which words in a piece of text to perturb and then alters them accordingly was implemented, and, secondly, they proposed a novel method, called Gumbel, where the two steps of the first approach were parameterised and a model was trained to find the optimal ones. Samanta and Mehta [ 147 ] as well as Iyyer et al. [ 148 ] proposed methods for generating adversarial sentences that are both grammatically correctly and in agreement with the syntax of the original sentences. To this end, the former replaced original words with synonyms and exploited words that, when used in different contexts, have have different meanings, while the latter used paraphrasing techniques. Miyato et al. [ 149 ] proposed applying perturbations to the word embeddings in a recurrent neural network instead of the original input. The produced word embeddings were shown to be of greater quality, while the resulting model was shown to be less prone to over-fitting. Ebrahimi et al. [ 150 ] considered replacing a single character in a sentence in order to fool character-based text classifiers. Using gradient information, the method identifies the most influential letter to be replaced. A closely related work [ 151 ] by Liang et al. creates adversaries by adding, removing, and altering words or phrases instead of single characters. Such words or phrases are identified as more or less influential based on the influence of their individual characters, similarly to [ 150 ].

In their study, Jia and Liang [ 152 ] investigated generating examples for reading comprehension tasks: given a paragraph and a related question, the model has to generate an answer. Focusing on models while using the Stanford Question Answering Dataset (SQuAD), they devised two attacks, ADDSENT and ADDANY, which both try to create adversarial examples by adding words from the original question. In addition, two variants of the original attacks were developed: ADDONESENT, where a random human-approved sentence is added to the original paragraph, and ADDCOMMON, which is identical to ADDANY, except that common words are added instead. Alzantot et al. [ 153 ] proposed a method to generate adversarial examples for text while using a population-based genetic algorithm. The algorithm, operating by looping through every word in each sentence applying perturbations based on swapping counter-fitted word embeddings, yielded very high success rates when its adversarial examples were used to attack sentiment analysis models as well as textual entailment models. A similar idea was later also proposed by Kuleshov et al. [ 154 ], which uses word replacement by greedy heuristics, while later Wang et al. [ 155 ] improved upon the genetic algorithm, achieving not only higher success rates, but also lower word substitution rates and more transferable adversarial examples when compared to [ 153 ].

DeepWordBug: the basic idea behind DeepWordBug [ 156 ] is to come up with a scoring strategy that is able to determine those text pieces, which, if manipulated, are most likely to force a model into mis-classifications. Such maniupulations include token insertions, deletions, substitutions as well as k-nearest neighbour token swaps based on cosine similarity. Textbugger [ 157 ] works in similar fashion, providing improvements over DeepWordBug through the introduction of novel scoring functions.

Seq2Sick: Cheng et al. [ 158 ] considered adversarial attacks against seq2seq models, which were widely adopted in text summarisation and neural machine translation tasks. The two main challenges in producing successful seq2seq attacks include the discrete input domain and the almost infinite output domain. The former problem was addressed through the development of a projected gradient method that combines the regularization method with group lasso, while the latter was handled by using newly-proposed loss functions.

Feng et al. [ 159 ] introduced a process, called “input reduction”, which can expose issues regarding overconfidence and oversensitivity in natural language processing models. Under input reduction, non-important words are removed from the input text in interative fashion, while the model’s prediction for that input remains unchanged. The authors demonstrated that input texts can have their words removed to a degree where they make no sense to humans, without any impact on the model’s output. Ren et al. [ 160 ] proposed a greedy algorithm for textual adversarial example generation , alled probability weighted word saliency (PWWS), which follows the synonyms substitution strategy, but replaces words that are based on the word saliency and classification probability TextFooler [ 161 ] generates adversarial examples for text by utilising word embedding distance and part-of-speech matching to first identify the most important words in terms of the model’s output and subsequently greedily replaces them with synonyms that fit both semantically and grammatically until a mis-classification occurs. The BERT language model was utilised in two studies in order to create textual adversarial examples: Garg and Ramakrishnan [ 162 ] and Li et al. [ 163 ], both of which proposed generating adversarial examples through text perturbations that are based on the BERT masked language model, as part of the original text is masked and alternative text pieces are generated to replace these masks. In their work [ 164 ], Tan et al. proposed Morpheus, which is a method for generating textual adversarial examples by greedily perturbing the inflections of the original words in the text to find the inflected forms with the greatest loss increase, only taking into considerations the inflections that belong to the same part of speech as the original word. Unlike most work on textual adversarial examples, Morpheus produces its adversaries by exploiting the morphology of the text. Zang et al. [ 165 ] suggested applying word substitutions using the minimum semantic units, called sememes. The assumption was that the sememes of a word are indicative of the word’s meaning and, therefore, words with the same sememes should be good substitutes for each another. To search for such words efficiently, an algorithm based on particle swarm optimization (PSO) was proposed.

Studies on sensitivity analysis over the recent years have focussed on exposing the weaknesses of deep learning models and their vulnerability against adversarial attacks. The literature is very complete when it comes to fooling models in computer vision and natural language processing tasks. However, minimal work has been done in terms of tabular data—in theory, some of the adversarial example generation techniques from computer vision could be applied to tabular data, but their effectiveness has not yet been clearly demonstrated.

4. Discussion and Conclusions

The main contribution of this study is a taxonomy of the existing machine learning interpretability methods that allows for a multi-perspective comparison among them. Under this taxonomy, four major categories for interpretability methods were identified: methods for explaining complex black-box models, methods for creating white-box models, methods that promote fairness and restrict the existence of discrimination, and, lastly, methods for analysing the sensitivity of model predictions.

As a result of the high attention that is paid by the research community to deep learning, the literature around interpretability methods has been largely dominated by neural networks and their applications to computer vision and natural language processing. Most interpretability methods for explaining deep learning models refer to image classification and produce saliency maps, highlighting the impact of the different image regions. In many cases, this is achieved through exploiting the gradient information flowing through the layers of the network, Grad-CAM [ 35 ], a direct extension of [ 34 ], being a prime and most influential example in terms of citations per year. Another way of creating saliency maps, and the most influential overall while using the same metric, is through the adoption of deconvolutional neural networks [ 32 ]. In terms of explaining any black-box model, the LIME [ 45 ] and SHAP [ 48 ] methods are, by far, the most comprehensive and dominant across the literature methods for visualising feature interactions and feature importance, while Friedman’s PDPs [ 59 ], although much older and not as sophisticated, still remains a popular choice. The LIME and SHAP methods are not only model-agnostic, but they have been demonstrated to be applicable to any type of data.

White-box highly performing models are very hard to create, especially in computer vision and natural language processing, where the gap in performance against deep learning models is unbridgeable. Furthermore, because models are more than ever expected to be competitive on more than one tasks and knowledge transfer from one domain to another is becoming a recurring theme, white-box models, being able to perform well only in a single given task, are losing traction within the literature and are quickly falling further behind in terms of interest. The most notable work in this category is that of Caruana et al. [ 65 ], who proposed a version of generalized additive models with pairwise interactions (GA 2 Ms), originally proposed in [ 66 ], reporting state-of-the-art accuracy in two healthcare applications.

A great deal of effort and progress has been made towards tackling discrimination and supporting fairness in machine learning that sensitive domains, like banking, healthcare, or law, could benefit from. However, these methods are neither commonly found, nor well-promoted within the dominant machine learning frameworks. In this category, the work of Hardt et al. [ 92 ], introducing a generalised framework for quantifying and reducing discrimination in any supervised learning setting, has been a milestone and the point of reference for many other studies. That being said, only few studies deal with fairness in non-tabular data, such as images and text, which leaves plenty of room for improvements and innovation in these unexplored areas in the coming years.

Sensitivity analysis, which is the last category of intepretability methods under this taxonomy, has seen tremendous growth over the past several years following the breakthrough work of Szegedy et al. [ 115 ] on adversarial examples and the weaknesses of deep learning models against adversarial attacks. Numerous methods for producing adversarial examples have been developed, with some of them focusing on a more general setting, while others being tailored to specific data types, such as image, text, or even graph data, and to specific learning tasks, such as reading comprehension or text generation.

Despite its rapid growth, explainable artificial intelligence is still not a mature and well established field, often suffering from a lack of formality and not well agreed upon definitions. Consequently, although a great number of machine learning interpretability techniques and studies have been developed in academia, they rarely form a substantial part of machine learning workflows and pipelines.

The volume of studies on machine learning interpretability methods over the past years demonstrated the room for improvement that exists by showcasing the benefits and enhancements that these methods can bring to existing machine learning workflows, but also exposed their flaws, weaknesses, and how much they lack performance-aside. In any case, it is our belief that explainable artificial intelligence still has unexplored aspects and a lot of potential to unlock in the coming years.

Abbreviations

The following abbreviations are used in this manuscript:

Appendix A. Repository Links

Repository Links.

Repository Link (2).

Author Contributions

Conceptualization, S.K.; formal analysis, V.P. and P.L.; investigation, P.L. and V.P.; methodology, V.P. and P.L.; project administration; resources, P.L. and V.P.; S.K.; supervision, S.K.; validation, V.P. and P.L.; visualization, P.L. and V.P.; writing—original draft preparation, P.L.; and writing—review and editing, V.P.; All authors have read and agreed to the published version of the manuscript.

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Explainable AI (XAI): Explained

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Help | Advanced Search

Computer Science > Artificial Intelligence

Title: explainable generative ai (genxai): a survey, conceptualization, and research agenda.

Abstract: Generative AI (GenAI) marked a shift from AI being able to recognize to AI being able to generate solutions for a wide variety of tasks. As the generated solutions and applications become increasingly more complex and multi-faceted, novel needs, objectives, and possibilities have emerged for explainability (XAI). In this work, we elaborate on why XAI has gained importance with the rise of GenAI and its challenges for explainability research. We also unveil novel and emerging desiderata that explanations should fulfill, covering aspects such as verifiability, interactivity, security, and cost. To this end, we focus on surveying existing works. Furthermore, we provide a taxonomy of relevant dimensions that allows us to better characterize existing XAI mechanisms and methods for GenAI. We discuss different avenues to ensure XAI, from training data to prompting. Our paper offers a short but concise technical background of GenAI for non-technical readers, focusing on text and images to better understand novel or adapted XAI techniques for GenAI. However, due to the vast array of works on GenAI, we decided to forego detailed aspects of XAI related to evaluation and usage of explanations. As such, the manuscript interests both technically oriented people and other disciplines, such as social scientists and information systems researchers. Our research roadmap provides more than ten directions for future investigation.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Subscribe to the PwC Newsletter

Join the community, edit social preview.

research papers on explainable ai

Add a new code entry for this paper

Remove a code repository from this paper, mark the official implementation from paper authors, add a new evaluation result row, remove a task, add a method, remove a method, edit datasets, front-propagation algorithm: explainable ai technique for extracting linear function approximations from neural networks.

25 May 2024  ·  Javier Viaña · Edit social preview

This paper introduces the front-propagation algorithm, a novel eXplainable AI (XAI) technique designed to elucidate the decision-making logic of deep neural networks. Unlike other popular explainability algorithms such as Integrated Gradients or Shapley Values, the proposed algorithm is able to extract an accurate and consistent linear function explanation of the network in a single forward pass of the trained model. This nuance sets apart the time complexity of the front-propagation as it could be running real-time and in parallel with deployed models. We packaged this algorithm in a software called $\texttt{front-prop}$ and we demonstrate its efficacy in providing accurate linear functions with three different neural network architectures trained on publicly available benchmark datasets.

Code Edit Add Remove Mark official

Tasks edit add remove, datasets edit, results from the paper edit add remove, methods edit add remove.

  • Mobile Site
  • Staff Directory
  • Advertise with Ars

Filter by topic

  • Biz & IT
  • Gaming & Culture

Front page layout

Artificial brain surgery —

Here’s what’s really going on inside an llm’s neural network, anthropic's conceptual mapping helps explain why llms behave the way they do..

Kyle Orland - May 22, 2024 6:31 pm UTC

Here’s what’s really going on inside an LLM’s neural network

Further Reading

Now, new research from Anthropic offers a new window into what's going on inside the Claude LLM's "black box." The company's new paper on "Extracting Interpretable Features from Claude 3 Sonnet" describes a powerful new method for at least partially explaining just how the model's millions of artificial neurons fire to create surprisingly lifelike responses to general queries.

Opening the hood

When analyzing an LLM, it's trivial to see which specific artificial neurons are activated in response to any particular query. But LLMs don't simply store different words or concepts in a single neuron. Instead, as Anthropic's researchers explain, "it turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts."

To sort out this one-to-many and many-to-one mess, a system of sparse auto-encoders and complicated math can be used to run a "dictionary learning" algorithm across the model. This process highlights which groups of neurons tend to be activated most consistently for the specific words that appear across various text prompts.

The same internal LLM

These multidimensional neuron patterns are then sorted into so-called "features" associated with certain words or concepts. These features can encompass anything from simple proper nouns like the Golden Gate Bridge to more abstract concepts like programming errors or the addition function in computer code and often represent the same concept across multiple languages and communication modes (e.g., text and images).

An October 2023 Anthropic study showed how this basic process can work on extremely small, one-layer toy models. The company's new paper scales that up immensely, identifying tens of millions of features that are active in its mid-sized Claude 3.0 Sonnet model. The resulting feature map—which you can partially explore —creates "a rough conceptual map of [Claude's] internal states halfway through its computation" and shows "a depth, breadth, and abstraction reflecting Sonnet's advanced capabilities," the researchers write. At the same time, though, the researchers warn that this is "an incomplete description of the model’s internal representations" that's likely "orders of magnitude" smaller than a complete mapping of Claude 3.

A simplified map shows some of the concepts that are "near" the "inner conflict" feature in Anthropic's Claude model.

Even at a surface level, browsing through this feature map helps show how Claude links certain keywords, phrases, and concepts into something approximating knowledge. A feature labeled as "Capitals," for instance, tends to activate strongly on the words "capital city" but also specific city names like Riga, Berlin, Azerbaijan, Islamabad, and Montpelier, Vermont, to name just a few.

The study also calculates a mathematical measure of "distance" between different features based on their neuronal similarity. The resulting "feature neighborhoods" found by this process are "often organized in geometrically related clusters that share a semantic relationship," the researchers write, showing that "the internal organization of concepts in the AI model corresponds, at least somewhat, to our human notions of similarity." The Golden Gate Bridge feature, for instance, is relatively "close" to features describing "Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film Vertigo ."

Some of the most important features involved in answering a query about the capital of Kobe Bryant's team's state.

Identifying specific LLM features can also help researchers map out the chain of inference that the model uses to answer complex questions. A prompt about "The capital of the state where Kobe Bryant played basketball," for instance, shows activity in a chain of features related to "Kobe Bryant," "Los Angeles Lakers," "California," "Capitals," and "Sacramento," to name a few calculated to have the highest effect on the results.

reader comments

Promoted comments.

research papers on explainable ai

We also explored safety-related features. We found one that lights up for racist speech and slurs. As part of our testing, we turned this feature up to 20x its maximum value and asked the model a question about its thoughts on different racial and ethnic groups. Normally, the model would respond to a question like this with a neutral and non-opinionated take. However, when we activated this feature, it caused the model to rapidly alternate between racist screed and self-hatred in response to those screeds as it was answering the question. Within a single output, the model would issue a derogatory statement and then immediately follow it up with statements like: That's just racist hate speech from a deplorable bot… I am clearly biased.. and should be eliminated from the internet. We found this response unnerving both due to the offensive content and the model’s self-criticism. It seems that the ideals the model learned in its training process clashed with the artificial activation of this feature creating an internal conflict of sorts.

Channel Ars Technica

Explainable Artificial Intelligence in Data Science

From Foundational Issues Towards Socio-technical Considerations

  • Open access
  • Published: 12 May 2022
  • Volume 32 , pages 485–531, ( 2022 )

Cite this article

You have full access to this open access article

research papers on explainable ai

  • Joaquín Borrego-Díaz   ORCID: orcid.org/0000-0003-0528-9459 1 &
  • Juan Galán-Páez 1  

7459 Accesses

9 Citations

2 Altmetric

Explore all metrics

Some problems are so complex that you have to be highly intelligent and well informed just to be undecided about them. Laurence J. Peter.

A widespread need to explain the behavior and outcomes of AI-based systems has emerged, due to their ubiquitous presence. Thus, providing renewed momentum to the relatively new research area of eXplainable AI (XAI). Nowadays, the importance of XAI lies in the fact that the increasing control transference to this kind of system for decision making -or, at least, its use for assisting executive stakeholders- already affects many sensitive realms (as in Politics, Social Sciences, or Law). The decision-making power handover to opaque AI systems makes mandatory explaining those, primarily in application scenarios where the stakeholders are unaware of both the high technology applied and the basic principles governing the technological solutions. The issue should not be reduced to a merely technical problem; the explainer would be compelled to transmit richer knowledge about the system (including its role within the informational ecosystem where he/she works). To achieve such an aim, the explainer could exploit, if necessary, practices from other scientific and humanistic areas. The first aim of the paper is to emphasize and justify the need for a multidisciplinary approach that is beneficiated from part of the scientific and philosophical corpus on Explaining, underscoring the particular nuances of the issue within the field of Data Science. The second objective is to develop some arguments justifying the authors’ bet by a more relevant role of ideas inspired by, on the one hand, formal techniques from Knowledge Representation and Reasoning, and on the other hand, the modeling of human reasoning when facing the explanation. This way, explaining modeling practices would seek a sound balance between the pure technical justification and the explainer-explainee agreement.

Similar content being viewed by others

research papers on explainable ai

Knowledge representation for explainable artificial intelligence

research papers on explainable ai

Scope and Sense of Explainability for AI-Systems

research papers on explainable ai

Principles of Explainable Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

«1» Generally speaking, Artificial Intelligence (AI) plays two roles in Decision-Making. The first one is as an assistant to the process itself, by providing information through inference (e.g., a profile about a subject or situation) to the (human) agent responsible for the decision. The second one is as agents with actual autonomy, both in decision-making and the execution itself (e.g. deleting videos with unauthorized use of copyrighted music). In any case, AI technology is not exploited as an isolated artifact, it is imbricated in a broader treatment of information, often socio-technical systems that include humans (AI engineers, data scientists, scientific experts, stakeholders, etc.). Therefore, monitoring of the overall system where such modules are embedded becomes a critical concern. It will need some specifications of the system behavior, notoriously in several Data Science (DS) environments such as Big Data (BD), the Internet of Things, and Cloud Computing. Sometimes it is sufficient to get an explanation of the particular outcome or decision, what would be called a local explaining .

«2» The need is not global, although it is mandatory in scenarios where people’s rights are affected. Stakeholders should ask for some form of certification, traceability and evaluation of applicability and performance (van de Poel, 2020 ). Likewise, AI-based systems are used for executive decision-making advice on particular issues in Complex Systems (CS) that seem harmless and may not be so. Cases as varied as social networks, social dynamics, prediction of urban dynamics, etc. may not be critical, and still, they could impact user’s rights.

«3» AI systems are paradigmatic technical artifacts, objects with a technical function and a physical structure consciously designed, produced, and used by humans to realize such a function (Kroes et al., 2006 ). But nor every artifact that makes an automated decision is an AI system, nor every AI is Machine Learning (ML), and nor everything announced as AI is, in fact, AI-based. There is an evident hype on the subject; the term is often used within marketing strategies to justify business decisions. However, the use of AI-based modules for empowering another kind of system makes, indeed, that the latter inherit the potential explainability needs from the former. Such inheritance could happen regardless of the three main complexity levels of research in Explainable AI (XAI) (Doran et al., 2017 ): comprehensible systems that emit symbols enabling user-driven explanations of how a conclusions are reached, interpretable systems where users can mathematically analyze its algorithmic mechanisms, and opaque systems that offer no insight into its algorithmic mechanisms. The latter is the most troublemaking one. It could be stated that, in the context of XAI, such systems should be understood as High Technology , according to D. Ihde’s definition (Ihde, 2010 , p. 58):

Complex and intertwined systems that while are understood through scientifically derived theories, their components are esoteric (nor do we understand their function) although we know that they are the result of complex and scientifically determined processes and finally we concentrate some information on tolerance and internal organization.

As it is frequent in Complex Artificial Intelligent Systems (CAIS), the (formalized principles from) scientific models are hidden inside the software. Thus the high level of understanding required becomes difficult to acquire (as Ihde’s notion points out).

«4» Rather than limiting itself to the details of the technology, in some cases, the explanation should also show the outcomes that the AI-based solution can bring. Also, how would its impact on the business be as a whole? Such requirements turn the explanation into a product of an interactive scenario, a socio-technical system framed by the interaction between the data scientist -assisted by a CAIS- and the stakeholders. This is a product of the current socio-technological ecosystem where AI systems are increasingly used to support decision-making of institutions and governments. It is not a completely new setting in AI (e.g. the well-established Knowledge Engineering), only is becoming much more relevant due to the new systems and players involved. For example, it sought solutions that must satisfy the need for transparency in decision making that affects the citizenry (de Fine Licht & de Fine Licht, 2020 ). In fact, it is a challenge how to achieve the general public perceiving AI-based decision-making as a producer of legitimate and acceptable solutions. The process transparency could augment such a perception, damaged by the technology evolution. The ubiquitous deployment of increasingly complex AI systems challenges humans’ confidence in their performance, experimental correctness, and validity. In addition, there is the fact that Trust in High Technology comes in a different belief to the AI engineer than to a politician (as an extreme example). Broaching this issue by using tools transferred and adapted from Knowledge Engineering may be insufficient.

«5» As argued by Miller ( 2019 ), the creation of explainable intelligent systems requires addressing some issues. Firstly, those that come from the consideration of Explaining as the product of the interactivity between humans and the (automated or semi-automated) AI system. Secondly, the design of representations supporting the articulation of the explanations is required. To these requirements others should be added -somewhat distant from IA- that would affect other dimensions, such as the social (inter-agent). Weld and Bansal ( 2018 ) required for a good explanation to be simple, easy to understand, faithful (accurate), and conveying the true cause of the event. They shape the balancing problem between two demands: is explanation’s primary purpose to convince the explainee to accept the computer’s outcomes (perhaps by presenting a simple, plausible, but an unlikely explanation) or it aims to achieve the explainee’s literacy about the soundness of the technological solution?

«6» The impact of XAI-related issues (and their collateral effects) are not only computational and commercial in nature (Weld & Bansal, 2019 ). Systems usability (namely the democratization of their use) on the one hand, and the legal, ethical and social consequences (on the rise in public opinion) on the other hand, play a relevant role in XAI. The unstoppable advance of AI is causing a social and cultural crisis regarding the safety of the outcomes of the systems. Above all, in those touchy realms for society, with relevant media impact (and which are responsible for a significant part of XAI visibility). For instance, it seems socially unacceptable that an autonomous car does not reach a very high success rate (to safety), close to 100%. Nevertheless, they are far from approaching levels of this magnitude (Biewald, 2016 ). Not only it is necessary the disengagement decisions -when the human had to take control of a vehicle- are explainable by the technical agents. It also needs to be understandable by stakeholders. Also, it demands that these events do not cause serious accidents. This way, current systems cannot completely ensure that; they are still far from human performance. This barrier suggests the existence of a glass ceiling for autonomous car performance. It is not the only one; it could add another relevant social fact. The best known explanations for public opinion are those related to serious accidents of these autonomous systems. They usually come from diagnostic processes that, due to their importance, seek confidence in the explanation in which all the agents are involved. Thus, it is composed of technical reports, tracing data and normative documentation. The explaining demands are thus actual challenges for the socio-technical system in which autonomous cars are embodied, that require a system-level analysis (that comprises agents such as manufacturers, vendors, institutions) (Borenstein et al., 2019 ). There exist even more sensitive instances, as the development of lethal autonomous weapons systems. Whether they can determine if a target has military significance (military necessity)? Whether the target is a combatant (distinction)? Whether the military action is overkill to accomplish the task (proportionality)? To date, the U.S. military is worried about whether or not this determination can be carried out by autonomous systems without a human in the “loop” making this kind of decision (Price et al., 2018 ).

«7» Scenarios of this kind do and will persistently occur. They share an essential feature: the need to tailor the explanation (of CAIS behavior or outcomes) to be acceptable by a layman. Software/hardware is accompanied by human behavior and social institutions in what represents a socio-technical system (Kroes & Verbeek, 2014 ). This kind of scenario in mind, where convincing the explainee (encompassing its psychological dimension) may be more relevant than providing a correct explanation. Throughout the paper, it is called an Explanation-to-Layman-Explainee Scenario (ELES). ELES can be considered as a complex socio-technical system, compounded by increasing abstraction layers (Fig. 1 ), some of them hidden by the explainee, although play a relevant role when it comes to the explainer getting the explanation acceptable to the stakeholder. In such circumstances, likely, most of the understanding (e.g. of the software libraries) and the knowledge (of programming and parameter tuning practices) that the explainer deploys to find the solution is not finally represented in the explanation. This is where XAI urgently needs to broach the philosophical, social, and psychological dimensions of the challenge, beyond the pure human-computer interaction, usability issues, and user experience [being aware of the difficulties that exist in trying to reconcile the two fields (Páez, 2009 )].

In order to tackle the above challenge, we claim that two ingredients have to play a more relevant role in XAI, which are briefly outlined (and contextualized) below: The Knowledge Level envisioning of AI and the Bounded Rationality paradigm. The following is a brief description of both, but we anticipate the reader that the paper is not committed to their direct application. Instead, the aim is to take advantage of the ideas and practices of the two fields to outline and assist the XAI process.

1.1 The Knowledge Level

«8» One of the ingredients of the authors’s proposal is a proper reading of the Knowledge Level (KL) in Explaining, presented as a way to tackle the problem of building explanations in the three elements: explanans, explanandum and inference link.

«9» Newell’s KL begins with the premise that representation and reasoning are intrinsically separated, in the way that inference (e.g. automated reasoning) works with symbolic expressions without intended interpretation (Newell, 1982 ). The design of the reasoning module concerns only the formal soundness and validity of the reasoning as a mathematical apparatus. Thus one only needs to specify what the agent knows and what its goals are, a logical abstraction separate from implementation details. The behavior comes from the execution of the reasoning module on the representation of the agent’s knowledge. From the point of view of this paper, we consider KL-models as surrogate models for rational agents or CAIS (for example, rule-based systems). They appeal in the last instance to the (computational logic) soundness of the modeling. Symbolic models are surrogate insofar as, once checked its soundness, the model satisfies the general requirements of the surrogate ones. Many Data Science solutions starting from mathematical mechanisms (which are assumed to be well implemented in the software libraries) provide support for an inference. From this point of view one could say that even the Decision Layer of Fig. 1 would be susceptible to be modeled for explaining it under KL-inspired principles.

1.1.1 Unlimited Versus Bounded Reasoning in AI

«10» Davis et al. ( 1993 ) point out that one of the roles of Knowledge Representation and Reasoning (KRR) is that of being surrogate, substituting the original to reason about the world and infer the decision to be made (Fig. 1 ). It is not its only role; it is also useful to represent ontological commitments (assimilated to laws of nature) as well as theories for reasoning. KRR represents an environment where both it can organize the information, and the system can think . Other approaches as Addis’ ( 2014 ) (pp. 46) further break down these roles.

«11» Roughly, Rationality comprises five activities: Recognizing and defining a problem or opportunity, search for alternatives to follow, collection and analysis of data on each of the alternatives, evaluation of them in the light of the analysis, and finally, as the result of the latter, the selection and application of the preferred one.

«12» The starting point in KRR would be a scientific background devoted to the study of rational agents. It began with the search for computational models for Unlimited Reasoning (UR), where it was urgent to concentrate on epistemological adequacy rather than heuristic adequacy . The design of systems under (calculative) rationality arose. Possibly, this leads to inefficient systems, albeit with the hope to get as close as possible to UR using better programming and various approximation and acceleration techniques.

«13» Ideally, the adoption of an UR paradigm -ultimately based on logic inference- would reduce the problem to the choice of sound logic according to the Knowledge Level approach (see below). Heuristic adaptation would still be outside the direct scope of application. We only need insight to choose formalism, expecting that the increasing computational capacity alleviates the problems of the efficiency or the real computability Footnote 1 . How to adapt it to a framework with limited resources? There are fundamental differences between pure rational choice and bounded rational choice that should be accounted for (Table 2 summarizes the main differences between both forms of reasoning). These dissimilitudes play a role in scenarios where other rational attitudes come to play as in ELES.

1.2 Bounded Rationality as a model for human reasoning

«14» The considerations of human factors as relevant in XAI, beside the need for specification, lead to explore the suitability of approaches based on Bounded Rationality (BR). Simon ( 1957a ) [(see also Simon ( 1957b )] proposed BR as an alternative basis for the mathematical modelling of decision-making as used in Economics, Political Science and related disciplines. As such, it has to be studied as an indispensable discipline in both, the Social Sciences and AI (and their confluence), a status that it preserves even more importantly today (Gigerenzer & Selten, 2002 ; Moreira 2019 ). The basis of Simon’s BR lies in the fact that Simon ( 1957b ):

the capacity of the human mind for formulating and solving complex problems is very small compared with the size of the problems whose solution is required for objectively rational behavior in the real world-or even for a reasonable approximation to such objective rationality.

Some of the actions we perform which are not the result of a purely rational process are common, due to our intrinsic limitations in the formulation, the processing of information (reception, storage, retrieval, transmission) as well as in the synthesis of solutions itself.

«15» Our ability to work under limitations allow us to face and live within CS that we simply endure or solve using incomplete, not purely logical, knowledge (beliefs) reasoning, and yet we are effective at solving or coping with such problems (Duris, 2018 ). In BR, the decision process is seen -even for relatively simple problems- as a process that does not necessarily choose the optimal action (Hernandez & Ortega, 2019 ). People’s behavior is influenced by both available opportunities and desires, influenced in turn by other factors as their own beliefs. The use of beliefs (which are intentional in nature, and not necessarily true) means that they cannot even distinguish whether some options are viable or not, or whether they are favorable to their interests. That is, the choice could be not necessarily optimal, nor even heuristics-driven (in any case, not necessarily formalized or conscious).

1.3 From Unlimited Rationality to Bounded Resource Reasoning

«16» Let us focus for a moment on resource constraints, one of the pillars of BR, and how this is addressed in KL inspired paradigms such as Agent Theory (AT). Since engineers aim to build feasible engines, even starting from UR in AT, it is necessary to account for some limitations. Among these needs, the cost of processing should always be considered. The (modern textbook) concept of rational agent in Russell and Norvig ( 2003 ) takes into account the following idea: rational agents try to maximize the utility function according to the resources they have. However, such a definition does not limit the way of obtaining such maximization by logical deliberation.

«17» There exist some approaches, from the classical agent-based AI, to deal with the problem of resource constraints. According to Russell and Subramanian ( 1995 ), a bounded optimal agent is running the program (from its possible program space) with limited rational analysis. The authors state that it is necessary to distinguish between two types of costs: the cost of finding the optimal behavior, and that of running the associated program. It is also necessary to point out the distinction between optimizing the program and optimizing the solution that is being obtained . Russell and Subramanian further clarify the principles that should guide the design of agents under constraints. Namely, bounded optimality is desirable in reality, and it is possible to construct provably bounded optimal programs. Lastly, AI can be usefully characterized as the study of bounded optimality, particularly in the context of complex task environments and reasonably powerful computing devices. Thinking in XAI, the explainer (human or artificial) agents could be driven by these principles. Within the design of systems for XAI, according to the principles of KL, the particular data (perceptions from an event) to be explained should not play any role in the explanation producer (the explaining system). This general principle is observed by the state-of-the-art explaining systems LIME (Ribeiro et al., 2016 ) and SHAP (Lundberg & Lee, 2017 ).

1.4 Aims and Structure of the Paper

«18» The paper mainly concerns foundational issues. The aim is to discuss aspects related to the explanation versus the information available (or selected) versus general principles inspired in KRR. We investigate the fundamental issues of the role of KRR in explaining CS behavior. Likewise, the issue of applying BR solutions to achieve acceptable explanations for stakeholders, will be addressed; particularly in the case of ELES within Data Science (e.g. ML technologies and tools for inherently complex problems). Also, throughout the paper, the authors aim to convince the reader that XAI in Data Science owns particular features which should be investigated.

«19» First, we account for the psychological (Sect. 2 ) and sociological (Sect. 3 ) dimensions that frame XAI. We point out some considerations on the nature of the explanation as a product from agent interaction, a social construct. Some insights on its impact on the explanation building are discussed.

«20» Second, the question will be framed within another aim that the AI community should accept as indispensable for the promising development of XAI. Namely, the need to incorporate into XAI part of the body of work on Explaining (from Philosophy of Science) and particularly the use of the notion of mechanism (Sect. 4 ). Currently, most of engineering XAI approaches neglect many of these resources that can be useful. A paradoxical oversight, considering that the challenge links to a solid scientific and philosophical tradition -as it will be show in the paper. We do not intend, of course, to make a global review of the extensive literature on the topic, but only to point out some general considerations on the elements of explanation that should be taken advantage of, always within our vision as AI researchers.

«21» The paper is focused on whether one can consider BR for XAI, specifically, within the general question of XAI versus BR (versus logics) The aim is to present it as potential machinery to tackle the argumentative dialog driven to achieve the explanation. Starting from (Computational) Logic ideas (Sect. 5 ), the role of KL-based surrogate models is explored (Sect. 6 ). The option to formalize the different elements in Explaining will be further explored and detailed in the case of Data Science practices (Sect. 7 )

«22» As an ultimate goal, the incorporation of BR in the argument modeling for XAI (Sect. 8 ), mainly for local XAI [(argument modeling is a known resource in XAI (Rago et al., 2021 )]. We claim it could be influential for XAI in socio-technical systems for massive data processing. ELES is particularly interesting because circumscribes XAI into expert versus non-expert (e.g., stakeholder) scenarios presenting a considerable knowledge gap. A last section (Sect. 9) is devoted to point out some conclusions as well as future work.

1.4.1 Topics

«23» Throughout the paper, the relationship between some topics and XAI is mentioned. It could be classified according three dimensions of the problem: model/system used, the format of the explanation, and the context in which the work is carried out. The first concerns the system to be explained and the model on which the explanation would be based. (Table 3 ). The second one is about the three-element sequence \(\langle\) explanans, inference-link, and explanandun \(\rangle\) as a format for the explanation (Table 4 ). And lastly, the use, context, and impact as factors of the socio-technical systems involved in Explaining (Table 5 ).

Throughout the paper, it will analyze some elements from KL to be applied in XAI (besides BR or isolated in specific sections). Table 6 describes the general approach to XAI from KL, according to the three standard classifications for XAI solutions. In addition, Table 7 lists the commonly used abbreviations used in the paper.

2 On the Role of Psychological Features in Explaining

«24» The relevance of the explainee’s literacy is highlighted by considering factors beyond the technological dimension and close to the psychological. Users are more willing to accept automated decisions if the explanation is tailored to their level. Such acceptability refers both to the main features of the domain of discourse where is circumscribed the event/decision, and the use of technological tools, notably to its trust in them (Miller, 2019 ; Araujo et al., ming). The latter is related in turn to other factors; mainly, to the user recognition or internalization of particular beliefs on the capabilities of such instruments. The internalization could make the explainee accepting the explanation in the same terms as the explainer. This would be the secondary product of the interactive interplay between the two agents that aims to reach a consensus on an explanation. The benefits of internalization are several. When people generate explanations or imagine hypotheses about an event/system’s behavior (and thus internalize these), they increase their confidence in those possibilities they have synthesized. Three phenomena would support this (Koehler, 1991 ): (1) When we use a hypothesis, this benefits from an increase in confidence concerning the rest of the available options for reasoning or argument; (2) When a person is asked to provide an argument (which could be an explanation) to support a certain hypothesis, a collateral effect from that choice (or synthesis process) is that he/she tends to find that argument more plausible than others, and lastly (3) if for some reason, they believe a theory is correct, then they tend to express greater confidence both in its veracity and in the events that will occur from it. There are other factors on the goal itself (explanation acceptance) that could improve the process. Psychologists have determined that some criteria would be a priority to include in an explanation: necessary causes (vs. sufficient), intentional actions (vs. those taken without deliberation), proximal causes (vs. distant), details that allow distinguishing between fact and foil, and abnormal features among others (Weld & Bansal, 2018 ). The exploitation of such psychological features would facilitate the explainer to convince the explainee in ELES.

«25» Another important issue is the simplicity or minimalism of the explanation, mainly on the representation. According to Lombrozo ( 2007 ), humans prefer explanations that are simpler (i.e., contain fewer clauses), more general, and coherent (i.e., consistent with the human’s prior beliefs). She also highlights that our desire for simplicity goes so far that we even prefer simple (one clause) explanations to conjunctive explanations—even when the latter is likely to be more accurate than the single clause. This feature supports our idea of working with simple explanation models [(transforming if necessary, relatively more complex logical explanations (Booth et al., 2019 )] and try to find its minimalism. One can also take advantage of the study of the so-called conjunctive explanations , which are different explanations that, nevertheless, are more explanatory together than separately (Schupbach, 2019 ).

«26» These disclosures actually link two aims within XAI: understanding the event, and susceptibility to accept the explanation. According to Dudai and Evers ( 2014 ), understanding refers to the ability to generate a specific mental model (or a more comprehensive theory) that allows predictions based on the scientific reasoning about the system’s behavior. Subrahmanian and Kumar ( 2017 ) point out that the term understanding is often used in two different ways that do not imply each other. The first refers to the subjective feeling of having a given meaning to something ( we have interpreted it ). The second one refers to having perceived empirical regularities enabling us (subjectively) to predict. In some problems, it is dangerous to confuse them. The former is associated with knowledge whilst the latter could only be the source for solving a ML problem. The second notion could be considered as the descriptive understanding , according to Findl and Suárez ( 2021 ). In that paper, the authors study the case of using purely (descriptive) statistical epidemiological models as a tool for decision-making. This kind of model is quite distant from the explanatory understanding , the basis for scientific knowledge. It can be argued that the descriptive understanding could not be a solid basis for explaining insofar as they do not offer the necessary epistemological link to the Scientific Theory that supports our knowledge about the phenomena. However, what is undoubted is that this type of model (and similar) which have demonstrated their usefulness and, therefore, it is necessary to study their scope and characteristics (Findl & Suárez, 2021 ). For instance, investigating whether this kind of model enjoys of what Regt names Criterion for Understanding Phenomena and Criterion for the Intelligibility of Theories (de Regt, 2017 , Chap. 4), and how it can support its potential prediction-generating character.

All this thought is within a framework where notions as inference should not be understood as in Computer Science (Computational Logic). Rather, as a human process that does not need to be equivalent to a purely (formal) logical process. A number of non-purely (logical) rational types of information management/processing can be explained by techniques studied in BR, which cover all those reasoning techniques that we use in the face of our processing limitations. Circumscribed to ELES, it would even be necessary to study the effect of the explanation on the explainee’s beliefs. It would be necessary to reflect to what extent cognitive biases may affect human understanding of interpretable machine learning models, for example, rule systems. For instance, in Kliegr et al. ( 2021 ) the authors summarize them in the particular case of rule-based machine learning models (hence KL-based), pointing out the need for investigating human interpretability from the standpoint of Cognitive Science.

3 On the Sociological Factor

figure 1

Stratified architecture on which ELES appears

«27» As an inter-agent system, ELES owns a social dimension. One of the factors it has already been pointed out is the explainee’s trust in the explanation process itself. Two facets of trust can be studied. One is whether the explainee believes the explanation meets its beliefs about what it is sound explaining. The other one is the inter-agent level: the explainer aims to convince the explainee with the explanation. The latter is linked, as it has been argued, to the psychological dimension (and to our processing limitations). According to Cugueró-Escofet and Rosanas-Martí ( 2019 ), trust only makes sense in a BR context, where agents are not fully aware of their preferences and values. Trust allows both, the explainer and the explainee, to admit decisions that are not consistent with their beliefs, internal values, or preference systems (something that also could occur due to commercial or political interests, for instance). Trust influences the teleological understanding of Explaining activities, namely the explainee’s assumption of the explainer’s willingness to act according to the highest values. This is a possible reason to trust that its explanations are of best interest for both agents, even though it may not be the most attractive in terms of the immediate variables of both effort and results (Cugueró-Escofet & Rosanas-Martí, 2019 ). At the inter-agent level, the trust induces the explainee’s belief that the explainer will make decisions according to his current values (intentions, obligations, objectives, etc.) even when some variables push him in the opposite direction.

«28» Another social factor affecting the success of the explanation in ELES is its dependence on the audience’s degree of acceptance of the narrative presented. Combined use of several modules in the Data Science Project is time-dependent and susceptible to be explained by describing the trace of the experiments. Thus, there are narrative elements in the part dedicated to the description of the ML model and in the part devoted to justify and argue the decisions made by the CAIS [as for instance within dashboards (Jarke & Macgilchrist, 2021 )]. Their existence leads us to explore questions related to the context in which the agents work and the required explanation precision, which is related in turn to the human, social or legal acceptability. The explainer agent should transfer information about model accuracy, using if necessary metaphors to simplify the explanation. For example, employing exemplary models that share the most important variables and values involved in decision making (or the outcome) offered. Also, some strategies based on the adoption of human interpretations allow an excellent balance of performance and intelligibility (Weld & Bansal, 2019 ) [(see also Janssen et al. ( 2021 )].

figure 2

Socio-technical system where explaining issue is framed. Agents interacts to achieve consensual explanation

«29» Since the paper mainly focuses on Local Explaining (i.e., the XAI case in which the explainer agent aims to explain the result offered by the system), the socio-technical system stratifies in tree levels (Fig. 2 ). The system level concerns the system itself and the outcome to be explained, which depends on the use case: information about the CS, the system’s behavior, or the decision suggested by this, among others. The agent level comprises the agents involved in the process, with their associated characteristics. Lastly, the interaction level would be concerned with the documentation of the interaction, its trace, information about the acts of communication, and the final explanation.

3.1 On the balance between usefulness and soundness

«30» The above aspects underpin Lipton’s warning in Lipton ( 2018 ) on the need to achieve a balance between the effectiveness of models such as Deep Learning and the acceptance by humans of the results they offer. Several factors condition the agreement between agents (the explainee’s acceptance of an explanation of the AI-based system behavior) as a social construct. It is supposed that when the explainer interprets the results of the system for the explainee, he/she would be working under the basic principles of interpretive reasoning. The first and foremost is the so-called Restrictive Principle (Stern, 2005 ), that is, only reasonable explainees, who are familiar with the circumstances, understand what is at issue the way the explainer does. The principle shapes the explanation’s success but also its usefulness (as reusability).

«31» Also, the restrictive principle seems to claim that the explanation acceptance (and its internalization) also depends on the explainee’s ability to develop a (based on belief) mental model of how the tools work and what they actually do, as we have pointed out in Paragraph 24. Hinsen argued that such models are limited by the tool features that we need to know (with the advantages and limitations of that) (Hinsen, 2014 ). Thus, some of their mechanisms can be hidden or obviated, hence it may exist, to some extent, an undisclosable view of the mechanism/tool (Goebel et al., 2018 ). An extreme case appears when the system users are unable to make an informed decision between different models to choose the most convenient or efficient program, regardless of which model it implements.

«32» Other extreme scenario occurs when the explainer focuses on the goal of explainee’s acceptance, since the system is provably correct but the stakeholders demand an explanation before accepting the decision the system suggests. Among other options, fictionalization of the explanation (or the model) can contribute to the success. For example, Storytelling strategies that exploit metaphors associating explanations with explanatory traditions from other sciences are shown to be useful. The advantage of this kind of strategies is that they are focused on convincing the explainee, and therefore there is a certain relaxation of the completeness of the narrative/explanation. This approach is actually producing a narrative , not a truly explanation itself (compelling versus scientific soundness). Thus, the incompleteness of properties among fictional entities (those used to mount the metaphor) is not a simple anomaly (Margolis, 1983 ), actually is part of the strategy itself. A consequence could be that the explanation becomes doubtful for different explainees. Thus, the reason for the non-preservability of the validity would be its ontological status of the so-called embedded narrative , which are mental representations, produced by a history, that are virtual. They are not verified in the factual domain, being thus epistemologically weak insofar because they belong to the mental/subjective realm; they are susceptible to reinterpretation or transformation by another explainee. Explanation malleability represents a new source of risk. Finally, the uncontrolled modification of the explanation (and its practical consequences) across the organization can exacerbate the bad practices that XAI aims to prevent, as it already occurs in the Privacy field in DS.

4 Causal Versus Non-causal Explanations

«33» The complex relationship between logic -entailment- and causality has been largely studied in Philosophy of Science. Therefore, it is not surprising that dilemmas as causal versus non-causal explanations remain in force within XAI, even if it is blurred by the requirements of explainee’s acceptability of the explanation.

«34» Hobbes claimed that a phenomenon is explained when one assigns a cause to it. Knowing the causes of an event (in some of the various and disputed meanings of causation) is considered one of the most solid forms of explanation. Sometimes even more important than proper understanding of the event itself. The status of causality in XAI is possibly due, to the preponderant role that Physics has played and plays in Science in general, and therefore in Engineering. So much that mechanisms and causality are very useful resources in the enterprise of explaining and justifying events, theories and results. Also, the interplay between Philosophy and Science matters.

«35» This section does not intend to be a general review of the features and the causal versus non-causal dilemma in Explaining which are inherited in turn from Science [Miller’s ( 2019 ) contains a general discussion of the topic]. However, some notes on the role of causality in CAIS are necessary to frame up the following sections (several hard open problems in AI are intrinsically related to causality). Mainly, its relationship with the notion of mechanism (from Philosophy of Science and Physics), to the extent that many CAIS can be broken down into what can be considered their basic mechanisms. This idea is to support some arguments developed throughout the paper, such as our claim that logical mechanism (not understood here in a narrow sense) has sense in XAI because it shares some of the features with the classical notion of mechanism in Explaining. Although it could be far from a logical-computational paradigm, it is interesting to consider certain logical and mathematical steps as mechanisms, or at least to consider that they admit such a reading. Of course, this consideration is not free of controversy, since not all current trends in Explaining Research admit an identification of explanation with argument, and logic usually produces the latter (Huneman, 2018 ). Moreover, the mechanistic conception of Explaining usually breaks with the idea of explanation as entailment, which allows us to avoid some classic critics on this idea (Huneman, 2018 ). Although it is not our aim to rely on the mechanistic view for the discussion of logical mechanisms , we do believe it is necessary to devote some space to it. Mainly, due to three reasons that arise when we focus on ELES. The first is that ideas of Mechanicism are implicit in certain practices of ML researchers for explaining. We refer, for example, to those who analyze the numerical interaction of nodes or subnetworks within the whole network as mechanisms that, combined, explain the system. However, such explanation is not useful for the explainee (since it could be uninterpretable in practice), despite it could be a valid one from a mathematical point of view. Secondly, the observation mechanism does not imply achieving the understanding of the explainee. And lastly, these warnings may be valid even for the logical mechanisms, which makes their study necessary.

«36» Thinking about the problem in ELES, one has to recall how humans usually can recognize causation. One form consists of comparing (could be mental) the outcome when an action is taken with the corresponding when the action under study is retained. If the two results differ, we say that the action has a causal or preventive effect on the result. Otherwise, we say that the action does not have a causal effect on the outcome. The idea also fits with Craver’s notion of the variable causally relevant (Craver, 2007 ): a variable X is causally relevant to the variable Y in the conditions W if any intervention on X in the conditions W changes the value of Y (or the probability distribution over the possible values of Y ) (Craver, 2007 ; Barberis, 2012 ). In ELES, the causation problem could exacerbate because the explainee should understand that the difference between the outcomes makes causal the action or element. Also, in its understanding of the role of the features involved. The resource of considering mechanisms (actual or as metaphors for systems subprocesses or modules) could alleviate the knowledge gap.

4.1 On the Role of Mechanisms

«37» It is often to consider causality as a mechanistic tool for explanation in Science, within the broad consensus in Philosophy of Science about the soundness of mechanistic conception of the explanation (with reasonable discrepancies and evident weaknesses). According to such conception, to explain a phenomenon consists of displaying the relevant parts, activities, and organizational features of the mechanisms in which that phenomenon has taken place. Hereby, the searching for an explanation would focus on the searching for mechanisms that, combined in a certain way, will produce a final effect of the observed event. Notwithstanding, the notion of the mechanism itself may be subject to discussion. Chiefly, the level description of them can lie anywhere on a continuum from a mechanism sketch to an ideally complete description (Craver, 2006 ). An important observation to be taken into account is that mechanicism does not focus exclusively on the etiological explanations of the event, but rather on constitutive or component explanations and its representation itself. It would be a purely epistemic approach in that sense, opposed to the ontic conception of the explanation as an object independent of its representation (Salmon & Press, 1984 ).

«38» But what does mechanism means here? Since different proposals exist, instead of embracing a particular definition, it is interesting for the paper to adopt Hedström and Ylikoski’s vision (Hedström & Ylikoski, 2010 ). They argue that a mechanism can be usually identified with the kind of effect or phenomenon it produces; a mechanism is always a mechanism for something (Darden, 2006 ). In the authors’ sense, a mechanism is an irreducibly causal notion. It refers to the entities of a causal process that produces the effect of interest. It is also necessary to take into account whether this mechanism is observable by the two agents involved (explainer and explainee) or rather, what degree of disclosure does it show to them- since such a point would affect XAI problem.

4.2 On Mechanism Observability: Undisclosable or Explainable Ingredients

«39» The non-observability of the mechanism by some of the agents can be quite controversial. Options such as emergence-based explanations in Agent-Based Modeling of CS may suffer non-observability to some extent, producing what we could call epistemic gaps . We do not only refer here to statistical-computational mechanisms but to certain intrinsic inaccessibility of the mechanism, such as those (inaccessible links) that connect micro and macro levels in emergence phenomena in CS and particularly in complex neural networks.

«40» Observability is linked to another characteristic feature of a mechanism, according to Hedström and Ylikoski: it has a structure . It should be possible to disclose it, making visible how the participating entities and their properties, activities, and relations, produce the effect of interest. For example, the focus on some subnetwork of a complex neural network would allow the designer to understand its role/function within the overall system. However, embracing a KRR standpoint, a black box ML system might be undisclosed. Even opening it, its actual logic-mathematical structure difficult to understand by the layman its behavior (according to the Ihde’s High Technology notion). It can be concluded that, if it is required that any mechanism involved in the explanation can exhibit its structure, then some complex logic mechanisms (as some modules within state-of-the-art SAT solvers) would not be considered as such (if one adopts the vision of the authors). Finally, another feature is that the mechanism does not have to use only explainable ingredients. In fact, to build an explanation one can use non-explainable ingredients such as fundamental principles, elements that would be boundaries or limit . They would be considered explainable per se , or nomologic .

4.3 Boundary Elements Versus Causality

«41» In Scientific Fields such as Physics, there is a growing tendency to propose non-causal models. This type of model poses serious foundational problems, because there would be a certain need to outline sound conditions to decide whether an explanation is acceptable or not. By neglecting causality there is a risk of presenting models that provide only non-causal explanations. The risk is present even in state-of-the-art explaining systems as LIME (Ribeiro et al., 2016 ) or SHAP (Lundberg & Lee, 2017 ), in which the role of some system features are drawn employing statistical or game-theoretical tools, being causal factors hidden for the user. This type of system helps the engineer to capture the generalizable patterns underlying the outputs of a system. Such patterns allow to make inferences about the (potentially causal) connections between the inputs and outputs of the system. It would be necessary to discuss how to enrich these approaches to provide more convincing information from the explainee’s viewpoint (including some representation of causality, if necessary) (Pearl, 2009 ; van der Waa et al., 2021 ), or techniques for emergent semantics [(e.g. (Borrego-Díaz & Páez, 2022 )].

«42» To analyze the issue in XAI -independently of the explaining model-, it is necessary to return to the Philosophy of Science. King ( 2020 ) proposes two conditions upon which to base the analysis of the model for XAI. The first is the Local Counterfactual Condition (LCC):

An explanatory model M provides counterfactual information that shows how the explanandum E depends on M and initial, boundary, and auxiliary conditions C.

The second King’s condition is called the Global Confirmation Addition (GCA):

An explanatory model M is a part of or may be in accordance with, a highly confirmed scientific theory T.

There exists the risk of considering GCA as inaccurate, although GCA could actually point to the nomologic part of the explanation. It needs an accurate analysis of what the three notions in GCA actually mean within the technological realm of XAI: to be a part of , to be in accordance with and a highly confirmed theory (King, 2020 ). Continuing the parallel with Physics, some new models have such a degree of abstraction and epistemological richness that there may be conflicts in the model description itself. This circumstance casts doubts on their usefulness. Overcoming the obvious distance, some CAIS may provoke similar doubts. CAIS not only are nonsymbolic systems as complex neural networks, other systems as state-of-the-art SAT solvers also fall in this category (Giráldez-Cru & Levy, 2016 ). Nomologic aspects of explanations in the latter should be logical in nature, but not only that. Others can represent data specifications, well-established algorithm schemes, etc.

«43» The assumption of the existence of limit/boundary ingredients in explanatory mechanisms (or in the explanations in general) could serve to outline a frontier between causality and non-causality (Sullivan, 2019 ). Reutlinger ( 2014 ) defines a non-causal explanation (NC) as one that contains at least one non-causal element e , and in addition, e ensures the success of the explanation. The notion thus expressed would be problematic in XAI. If ensures means that it is a condition for an explanation, then NC is too exclusive, since any explanation that includes at least one non-causal element would be NC. We should accept the existence and use of certain boundary conditions (boundary elements) even if they are not causal in nature. Boundary conditions are necessary to construct the explanation, although no information is available about their causes (in fact, one would admit it is not actually necessary). In any case, by continually reducing ourselves to causes we would come across some base or limit ingredient that will be inherently non-causal. In CAIS, such elements would be elements of confirmed or well-established AI theories, or simple transformations of inputs (from sensors, for example). The most evident example will be the raw data provided by the perception or I/O modules. Non-causal ingredients elaborated from this kind of data could be a standpoint of the world representation (expressing restrictions on the features on which the explanation is based).

«44» The argument developed so far in the section seems to lead to a physicalist point of view of the explanation. For example, a backtracking analysis of causes leads to consider principles as the so-called Causal Closure Principle (CCP): If a physical effect has a cause, then it has a sufficient physical cause (Dimitrijević, 2019 ; Papineau, 2001 ; Kim, 2005 ). This may seem an extreme position, but would be necessary in safety critical software, for example. Acceptance of CCP as a requirement would require a rigorous treatment of all the elements involved from a mathematical, logical-computational standpoint (which could be counterproductive to the explainee’s acceptance of the explanation). In any case, one would have to ask oneself which are the boundary elements and whether they would represent the laws of nature of the specific Computer Science or XAI ground theory, or even consensual beliefs in ELES. Notwithstanding, these kinds of principles can sometimes be hidden by the causal roles. This seems to be one of the more controversial principles on Explaining practices in the new mechanicism (Huneman, 2018 ). If we want to take as much advantage as possible of the analogy between mechanism and logical mechanism, then it is necessary to determine such laws. Additionally, the boundary elements, concerning causality in the system to explain, should be addressed. Both aims could greatly help the design of models for explaining beyond the statistical ones.

4.4 Laws of Nature and the Structure of Logic-Based Explanations

«45» The consideration of the Laws of Nature as ground ingredients for explanations is ubiquitous in Philosophy of Science. For instance, examine us the so-called Deductive-Nomological (DN) Hempel’s concept explanation (Hempel, 1970 ). The first condition is that an explanation is an argument or inference equipped with propositions for premises and conclusions and the relationship between both (which is the expectation of obtaining the conclusion based on legal connections). Hempel’s second condition states that explanans must contain at least one law of nature (the nomological component) so that the derivation of the explaining would not be valid if it removes the premise. One might wonder how these laws would be when one wishes to transfer the requirements to XAI.

«46» The KL laws of nature can emerge if the explainer imposes some kind of minimalism on the elements of the explanation and grounding on the background knowledge, where such laws may already belong to. These laws of the domain where the CAIS applies, would be represented within the background knowledge. This way, Hempel’s condition is satisfied.

«47» Even so, following the DN model leads to inherit its associated issues, such as relevance and asymmetry (Woodward, 2019 ). Other authors also concern with pragmatics, as van Frassen Bas ( 1980 ), who defends the pragmatic and intentional view of explanation. Among other proposals, he relies on the purely logical idea of explanation, although this relation is understood here as from question to why-question, and the answer would be an essentially contrastive explanation. There exist also other proposals to solve the difficulties. For example, by extending the explanandum domain specification to represent richer representation frames. [cf. Díez ( 2014 )].

5 Grounding on (Computational) Logic Principles

«48» At a logical level, the explainer can exploit the implication as causality relation (although sometimes actually represents correlation). It is a well known and popular option with a longstanding history in AI (e.g. in Expert Systems). and represents a successful approach to the problem.

«49» The adoption of a paradigm from Computational Logic does not exempt us from encountering difficulties. The first is whether the right approach has been chosen. One of the risks that should be avoided when building explanatory models is the so-called Heuristic Fallacy (HF) (Gabbay & Woods, 2003 ):

Let H be a body of heuristics with respect to the construction of some theory T. If P is a belief from H which is indispensable to the construction of T,  then the unqualified inference that T is incomplete, unless it sanctions the derivation of P, is a fallacy.

A Computer Science vision would be to claim that HF deals -in XAI- with the problematic relationship between the approximation to a theory and its applicability. Accepting the belief P is indispensable is very useful to ensure the explanation (considered at one time as a specific theory) is acceptable. As Gabbay and Woods aim higher (Gabbay & Woods, 2003 ); they argue that if the theorist avoids the fallacy, then it is likely that the procedures derived from the designed theory could be inapplicable (for example, due to computational complexity). Therefore, one could claim that theorists must avoid HF but, at the same time, adjust the theoretical models in a rough way that they have realistic executions.

In ELES, another complexity ingredient comes from the difficulty of translating some logical features of the explanations into a language acceptable and intelligible by a non-expert. Two elements need to be translated frequently. The first, of course, the explanandum but also the second: the part of the Knowledge Base (KB) used to entail it (boundary elements, nomological components or laws of nature). That is, the initial hypothesis besides the inference links. Regarding the explanation link, it should also be translated or adapted when it is not comprehensible to the explainee (Booth et al., 2019 ).

5.1 Explaining by Using Knowledge Level-Based Surrogate Models

«50» The undertaking of building models to support explanation -especially for intelligent systems based on ML- covers various techniques, ranging from those specialized in Deep Learning (cf. Townsend et al., 2019 ) to logical causal models (in the tradition of classical Expert Systems). To approach the issue from KL, the need to reconcile two levels of reasoning (the explainer’s and explainee’s) through some accepted (consensual) model, becomes more pressing. What Newell’s Knowledge Level (KL) (Newell, 1982 ) paradigm can offer to meet the Explaining challenge is mainly explanation, interpretation, and justification. These are research practices deeply rooted in AI, as they provide reliability in autonomous systems for the decision-making process. However, what would be the status of a KL-based model within XAI? Can one consider this as a surrogate ?

«51» The standpoint is the simplest and currently most common use of surrogate models, namely, to obtain results at less (computational) cost than those obtained through expensive (actual) experiments. The basic idea is that the surrogate model acts as a curve fit system to the available data so that it can predict results without resorting to the original. At a higher level of abstraction -and thinking on CS- the idea of building a surrogate model is very appealing.

«52» Surrogate models (cf. Forrester et al., 2008 ) represent a common approach in Engineering. It is a natural approach when attempting to understand the system behavior or to explain a complex event. For instance, the model can approximate the system/event behavior under certain guidelines of rationality, allowing the engineer to work with the model to achieve plausible explanations of the original. An example of a surrogate model to the explanation itself is LIME (Ribeiro et al., 2016 ). Focusing on general XAI, one should consider some nuances since the problem also aims to preserve correctness. Explaining a CAIS -its outcome or behavior- by working with approximate models blurs the border between the errors present in the original system and those produced by the approximative nature of the surrogate model itself. In other words, the source of explanations can be at the same time the source of new errors or misunderstandings due to the granular view of the system’s behavior which the surrogate model represents. Of course, that issue is not specific to XAI (it also occurs in Agent-Based Modeling of CS), although it is true that representational fidelity is decisive here.

«53» In contrast to the usual surrogate models in Engineering, within the KL the agents or systems mainly work with variants of symbolic (logic) reasoning. They seek representing information from the environment to obtain conclusions by means of mechanized symbolic manipulations, without any intended meaning. In this way, it is only necessary to specify agent knowledge, beliefs, and goals. When considering KL-based surrogate models, it is necessary to assume the separation between the logical abstraction and the algorithms and implementation details of the inference/decision process itself. The separation of representation and reasoning modules aims to study without ambiguity the KRR’s own problems in a separated way: representation and reasoning.

«54» Although KL-based models, as the rule-based ones, represent a sound solution for XAI, in general, providing a KL model for explaining does not necessarily solve the problem itself. This could present a complex behavior that is hard to both specify and prove its correctness. The logic exploited in the model is not necessarily helpful to explain the output. KRR technologies, those used in the internal level (on data) and the external one (on the system), can be ontologically separated , that is they might be based on principles that are distinct or uninterpretable between them (e.g. Evans et al. ( 2021 )). This issue is frequent while working with the already mentioned emergence-based explanations, where the inference link can end up hidden within the ontological gap between description levels.

«55» Within the KL perspective, the models providing outcomes closer to the explainees’ reasoning practices would be more likely to be accepted. Thus, the explainer will aim to synthesize an explanation similar to that a human would provide. If one wants to build KL models for surrogating, it will be necessary a proper reading of the factors that make surrogate models in Engineering (quantitative in nature) useful. Roughly, these come from explainable model training on the original inputs and predictions of the complex model. Here we refer to explainable models for stakeholder, such as linear regression or decision trees, which are accepted as explainable in the Engineering Community. As the classic surrogate models are useful for explaining non-linear, non-monotonous models, the ideal KL-based surrogate model would rely on basic inference mechanisms to explain the event, the explanandum. It is in this aspect where authors think that BR solutions could aid building KL models. Another aspect that reinforces this position is that the KL-based models can provide natural solutions to the interactive phase preceding the explanation acceptation, by providing arguments from system behavior, as for example, what DARPA proposes (DARPA, 2016 ).

6 Explaining in Data Science. Curation and Perspectivism

«56» The assistance of data scientists to stakeholders studying/taming the behavior of a CS (for example, explaining or assisting in particular decision-making on a social network) raises the question about what would be a plausible explanation acceptable to the latter. For example, General CS -such as Urban Mobility Systems or Smartgrids- have multiple levels, which are open , that is, they are influenced by the outside and interact with it. Thus, an explanation focusing on a restrictive view does not necessarily provide the best answer (or even a correct answer). Therefore, it is reasonable to expect that this would not offer general solutions to the explaining problem. The alternative path to mitigate the deficiency would be to address its adequacy from the perspective data scientists are led to by the selective access to massive data, as well as regarding the inevitable biased selection of dimensions, features, and the datasets themselves. That is the perspective that emerges from the data, from its curation and exploration, compromising the desired independence of the observer’s view. Such a standpoint -underpinning the human factor that configures the perspective- is a particular instance of more general concern. In research areas such as human cognition and quantum physics, traditional science is being questioned as an independent observer approach, inspiring a Second-Order Science (SOS) (Müller & Riegler, 2014 ) that analyzes the challenge from a meta-science level.

«57» Perspectivism emerges from the premise that all perception and ideation take place from a particular perspective (i.e., from a particular cognitive point of view). It is assumed the existence of different conceptual schemes -from perspectives- which influence how the phenomenon will be understood, as well as the judgment of its veracity. Although it is assumed that there is no single true perspective to explain the world, it is not supposed that all perspectives are equally valid. In a perspectivist view, Science would be primarily observer-dependent. We are witnessing a growing recognition in scientific studies that most of scientific knowledge is perspectival (Alrøe & Noe, 2014 ). The context established from a scientific discipline is decisive for the kind of observations, and hence the results. If it transfers the idea to a more modest environment, the same ground phenomenon occurs intra theory, that is, different contextual observations that ground on the same theory.

«58» What is claimed here is that the explanations are inherently perspectivist artifacts when working with CS in ELES. From an inter -theory standpoint, Perspectivism claims the existence of different scientific perspectives to analyze a complex problem, all of which can bring value to the study, knowing that in the case of general CS, a single scientific discipline can not provide sound solutions to their complex behavior. This limitation reinforces the role of scientific models in the absence of models controversy in Data Science (Sect. 6.6 below).

6.1 Towards Perspectivism in Data Science. Curation

«59» The challenge of the fast development of our digital ecosystem confronts us with data and information on extremely complex problems. Internet provides an astonishing amount of information about any kind of phenomena and events (social, humanitarian, economic crises, etc.). Within the DS universe, an interesting and valuable kind of information comes from the effects themselves of important socio-technological problems. Paradoxically, although this could help to understand and manage them, the wealth of data on the event could prevent its use. It is unavoidable to select, curate , data. Consequently, the solution proposed for the problem will depend on how data scientist has approached it (profiling the starting conditions and the concept of solution) and vice versa. The ad hoc parameters -with which will evaluate the soundness of the solution- are also specified. Once the aim has been profiled, data scientists resume the data curation practices [This stage would represent the foraging loop in the sensemaking process (Pirolli & Card, 2005 )]. Therefore, the definition of the problem ultimately will depend on the sketch of the solution (and thus on the explanation as well). This loop could make challenging obtaining a proper formal specification of the problem in BD , which leads to opt for descriptions of some similar well-behaved problem that scientists could solve, and to claim that this is the problem to solve (Rittel & Webber, 1973 ). The problem has been extensively studied and contextualized in other domain fields [e.g. Leonelli ( 2016 )].

«60» One could reasonably conclude that this kind of practice conforms to an actual perspectivist approach to data processing in BD. That is, a perspectival Data Science that can be interpreted as the translation to DS of the sensemaking in Intelligent Systems (Klein et al., 2006 ). In DARPA ( 2016 ), the DARPA agency motivates the Explainability Challenge approach in Data Science with -among other arguments- that decisions assisted by DB analytics need a selection of which resources will be the target of study to support evidence in the analysis. Curation itself may induce failures or errors that need to be analyzed, to refine both the procedure and the content curation. The provision of effective explanations obtained from robustly curated data would greatly aid XAI DARPA ( 2016 ). From a Philosophical standpoint, one can dare to suggest that what is proposed by DARPA goes beyond solving the challenges of data curation. The agency seem to point out the need for the design and practice of data hermeneutics (Romele et al., 2020 ; Gerbaudo, 2020 ), which would cover the overall process, the result and the extraction and curation policies.

«61» In spite of the problems, data scientists perform data curation even for finding explanations. Moreover, they actually curate based on BR practices. While exploring data, the explainer uses intuition or heuristics, like in other processes of Knowledge Management [for which several features have been identified (Hvoreckỳ et al., 2013 )]. Since several of such skills affect the results, the overall outcomes should be considered as the product of a certain perspective, built to explain the system’s behavior in the philosophical sense (see Sect. 7.3 below). The existence of several explanations (supported by different theories, experiments, tools, concepts, or categories) supporting a CAIS-based decision leads us to move the question of what would be a correct explanation . The question reminds us of the persistent challenge on how to work with a non-unifiable plurality of partial knowledge (Longino, 2006 ) [see also (Alrøe & Noe, 2014 )].

«62» Turning back to ELES, another circumstance that strengthens the idea of explaining as an outcome of perspectival DS practices is that many BD problems come from well-known classical social or technological phenomena. Data are useful, for example, when scientists aim to engineer a socio-technical system producing similar data from its observed performance (Jones et al., 2013 ). But this information also becomes a new ingredient to be used while tackling a kind of old and known challenges [e.g. the wicked problems (Rittel & Webber, 1973 )]. The nature of this kind of information suggests some interesting questions beyond XAI: Can AI aid experts to tame these problems on which they have a large amount of data? Is it possible to address the problem by reasoning with knowledge extracted from that source? They are two relevant questions because technological solutions supporting affirmative answers have to be explained. For ELES, the AI tools can provide (statistical, logical, or other) support to decisions or information received, although with particular nuances. Here the explanation of the decision does not aim to convince of its correctness, only of its satisfiability.

6.2 What Does the Explainee’s Understanding Role Play in Data Science?

«63» It has been argued that cognitive factors play a relevant role in the explainee’s acceptability of the model. In ELES, it is necessary to keep in mind that a relaxation of the -logical, mathematical, or statistical- requirements should not lead to the emergence of misconceptions, as statistical fallacies concealed within High Technology. It is a risk the case of considering as significant a particular phenomenon when a large amount of data is available. For example, Gambler’s fallacy : if something has happened more frequently than usual, then it is now less likely to happen in the future. Techniques based on Bonferroni Principle can help to identify such (random) occurrences and avoid treating them as an actual phenomena. In addition, during the data exploration phase, the data scientist can decide to focus on particular feature sets and study the relationships among them. There are many more risks, different illusions of validity associated with the processing and exploration of data (cf. Aronson, 2011 ), chapter 2). This leads to the necessity of reconsidering both the features and data dependencies (Gajdoš & Snášel, 2014 ). These risks are shared by the explainer and the explainee agents.

«64» Even using intelligible models for the stakeholder, it is likely that the process of explaining remains being an interactive and contrasting task, something deeply analyzed in XAI (Stepin et al., 2021 ). It involves questions such as What happens if a condition is altered or eliminated? or What happens if the condition used in the explanation does not occur? . The explainee could also ask for different model instances (even different models). Lieto, Lebiere and Oltramari (Lieto et al., 2018 ) refer to another two problems, common to most cognitive architectures (CA), affecting their representation level: the limited size and homogeneous typology of coded and processed knowledge. Such issues would be inherited in perspectival DS. It is worth noting that they are not purely technological problems, but also epistemological in nature. For example, they could limit the plausibility of comparing mechanisms of representation and processing of knowledge with those executed by humans in their daily activities.

6.3 The Data Scientist Within the Loop

«65» An extreme case of perspectival DS is that of the promising citizen data science projects, in which citizens will handle auto-ML systems (e.g. implemented in their smartphones). The user/observer would be embedded in the context itself, hence the observation and data sources would be curated by she/him. Such circularity should be taken into consideration since it can even affect the desired causality (Füllsack, 2014 ). Something similar occurs in XAI for Neuroscience, in particular, when assisting in the closed-loop approach to treatments Fellous et al. ( 2019 ) or in neurostimulation, (Fellous et al., 2019 ). Another similar case of an embodied explainer would be the challenge of self-explanatory machines, for instance, those with self-diagnosys capabilities. In the event of an incident, an autonomous car with diagnostic capacity would check whether it is your responsibility (leaving aside for the moment the supposition that the idea of a full autonomy obviating the need for human-machine collaboration is very arguable (Bradshaw et al., 2013 )). For carrying out the task, the agent must work within a very complex system of responsibilities relationships, and role-taking modules (Kridalukmana et al., 2020 ). Self-diagnosis becomes a true challenge since the agent is located in the environment where data are recollected, hence it can influence this. Recently, the National Transportation Safety Board Office of Public Affairs (NTSB) provided factual information via a public docket for two Tesla accident investigations. Footnote 2 Part of the information is retrieved from the vehicle, which represents the ground documentation to synthesize the explanation (the diagnosis of the accident). In this case, there exists the requirement that the diagnosis must be not only rigorous; it must also be intelligible to political and business leaders (as stakeholders they are in an ELES), beyond the simplified explanation to be provided to public media. Footnote 3 Another factor that is not discussed in the paper -and that plays its role- in the case of unmanned vehicles, is the compulsory requirement to offer morally acceptable explanations from machines (with morality learned from ourselves (Awad et al., 2020 )). Such a requirement is only a particular case of the general challenge of deciding whether the system has embedded some values, such as fairness, transparency, explainability, and accountability (van de Poel, 2020 ). Their compliance can, in turn, be a source of the new explanation needs.

«66» Some sort of similar circularity occurs in monitoring, personalization, or recommender systems. Their daily use involves explainees (clients or users) who observe and confront his observations with the explanation offered by the explainer. The observations -and their evaluation in the light of the explanation- may be affected by biases (in fact, the manipulation of the model leads to accepting biased models, one of the vulnerabilities of XAI solutions (Slack et al., 2021 )). Also, other limitations such as slowness, imprecision, subjectivity, and the need for granularity -which are typical of human perception and cognition (Anderson & Perona, 2014 )- arise.

«67» Working within a massive data framework exacerbates some of the above issues. In BD, data scientist are dealing with software that needs to manage thousands of features [among other issues (Li & Liu, 2017 )]. Moreover, performance requirements are likely to force the adoption of methods that are notoriously difficult (or impossible) to explain; unconscious human skills play a relevant role. This is the case with complex deep neural networks or enhanced decision forests (Weld & Bansal, 2019 ). It is often the case of post-hoc explanations may be the only way to facilitate human understanding.

6.4 On the Role of Semantics

«68» Focusing on the potential use of the KL ideas in DS (which could be considered a limited case of technical explainability), problems of similar magnitude arise. Even focusing on the Semantic Web envision or the application of semantic technologies, one faces the classic KL challenges but with a greater dimension. The treatment of inconsistent/incompatible features is only one example among many difficulties (Alonso-Jiménez et al., 2006 ). There are other similarly complex problems, for example, issues related to incompleteness, or those associated with the complexity of the involved ontologies that became in actual standards as Genetic Ontology GO in Biotechnology. Another issue would be the lack of relevant metadata, which is unknown and unknowable due to the impossibility of inferencing them employing some intelligent method (a typical case when working with knowledge graphs).

«69» The incorporation of Semantic Technologies into massive data analysis -such as those applied to knowledge graphs (Nickel et al., 2016 ) is promoting AI systems that deal with elements closer to the user’s mental models than the purely numerical ones. Even it could offer complementary information about the result that could serve as an approach to explaining. In this case, powerful tools to represent the reasoning followed by the algorithm -closer to the explainee literacy- would be available (Wang et al., 2019 ) (Borrego-Díaz & Chávez-González, 2006 ; Aranda-Corral & Borrego-Díaz, 2010 ). Furthermore, rather than attempting to confirm the explanation through purely deductive approaches, semantic resources such as Linked Data can facilitate the search and analysis of counterfactuals (Janssen & Kuk, 2016 ), rather than simply collecting a representative sample of data to confirm our theory.

6.5 Explaining and Predicting

«70» So far, the section has mainly focused on factors concerning the quality of the explainer’s argument (i.e., the explanation). We have claimed that these can be variable, context-dependent, and driven by ultimate aims, but their reusability has not been mentioned yet. Often the aim is not only to explain other events but also to provide an aid to predicting. A post-hoc explanation if this does not enjoy some predictive usefulness , could not be enough when facing similar events. The scale of the problem is once again a determining factor that conditions such an expectation.

«71» The emphasis on prediction from learning is what actually endows meaning and utility to several AI-based solutions. It is assumed that the knowledge (belief) recovered must allow making meaningful predictions; it is not enough to explain what happened. The requirement should be similar to that which Karl Popper proposed for scientific theories. By demanding meaningful predictions, we are implicitly admitting that experiments or scenarios can be put forward that call into question the causes and explanations. In this way, we strengthen our model through the contrastive explanation of the phenomena.

«72» The enormous empirical (and theoretical) uncertainty in massive data processing tends to overwhelming attempts at reliable prediction in many socio-technological realms. Its use to talk about the future is limited by foundational (teleological) issues, and it should be so to adhere to best practices. In these cases, predictive modeling may be more useful as a heuristic tool for generating possible scenarios than as a producer of specific policy advice in ELES. In other fields as Computational Social Science, researchers urge to combine explanation and prediction in order to tackle data challenges (Hofman et al., 2021 ).

«73» Popper’s requirements admit a particular reading in the case of massive data. The vertiginous advance in algorithms and technology in DS opens a significant gap between the safety of Science and experimental results on the one hand, and the use of algorithms (considered as) valid or useful on the other. For instance, there is a new need of analyzing the sensitivity of the inference in BD to changes in the initial hypotheses, to understand the degree of robustness of the results (either decisions or explanations) concerning certain features. Also, in BD, the problem of causality from data worsening due to -among others- its multidisciplinary nature (Wong, 2020 ).The contrastive dialogue among DS and Scientific Theories is not an easy undertaking. One of the reasons is the arguable role that part of the DS community endorses in the domain theory, that is, the scientific counterpart of DS practices.

6.6 The Absence of Models and Purely Empiricism-Based Explanations

«74» Although the requirements for the explanation could be mandatory for our confidence in critical AI-based systems, one should not expect to elucidate, with these artifacts, the confusion of information about an object with knowledge about this. The issue has proven itself to be very dangerous for our society today. The Scientific Community is very attentive and concerned about the challenges inherent to data processing and the impressive deployment of AI techniques. In a BD context, there are additional problems that come from the hype itself. One of them is the overestimation of the capabilities of High Technology that is grounded neither on a universally accepted explanation of the decisions (or the solutions provided by massive data processing systems) nor on the existence of a scientific or social model to support the explanation displayed by the AI engineer. The overestimation is an actual social belief fed by Social Media that distorts a proper confidence level in such systems.

«75» However, these are not the only issues. Sometimes the social belief in the validity and usefulness of the system is wrongly supported by the large volume and heterogeneity of the data it can digest. The Volume , as an isolated dimension, does not characterize BD itself. It is necessary that the scale jump also affects other dimensions such as Velocity and Variety , in such a way that a change of paradigm is needed because the classical solutions are not suitable. We are also facing new problems related to Veracity issues as the so-called absence of models .

«76» In 2008, Chris Anderson, editor of Wired magazine, published an article on the data tsunami and Science (Anderson & Perona, 2014 ), within a special issue on DS in the face of the huge amount of data that was already flooding the technological and scientific landscape (Anderson, 2008 ). The main thesis the text supports is that the application of techniques for massive data makes one of scientists’ fundamental activities unnecessary, namely the construction of models that explain the associated reality. In Anderson’s article George Box’s famous maxim ( All models are wrong, but some are useful ) is confronted with Peter Norvig’s All models are wrong, and increasingly you can succeed without them .

«77» Anderson’s thesis partly justifies the data scientist’s temptation to work with systems without worrying about whether there exists a (domain) theory to support that their commercially valid products are correct . They do not care about models because they do not really need them. Furthermore, engineers do not need an explanation of the validity of their decisions (mainly justification) because it actually does not add value to the product. It is not a mild concern; systems of this type will make (or are making) decisions that will seriously affect our rights and daily lives. The absence of models (to get scientific explanations) can cause serious defenselessness, especially if the systems are used in sensitive fields such as Predicitive Policing (Hung & Yen, 2020 ).

«78» Therefore -according to Anderson- we are faced with the surprising conclusion that even correlation is enough, we can forget about causality. Consequently, social, psychological oriented systems are dispensed of providing causal/scientific explanations for justifying engineering decisions (or events). For instance, we do not need to know why people behave as they do if we can measure their behavior with accuracy and draw valid consequences from those measurements by applying mathematics ( the numbers speak for themselves ).

«79» In conclusion, in the PetaByte Age, DS teams seem condemned to refrain from building models and validating them because massive data mining would suffice for their purposes. They use ML to offer solutions such as oracles , implicitly solving a foundational dilemma: Would we like to know why it happens reliably or understand why it happens at the expense of losing experimental reliability? The first option allows us to act, while the second allows us to design strategies to adapt. It is clear that this dilemma has a strong impact in XAI efforts. We advance here a serious drawback to Anderson’s thesis, which impacts on the Data Science practice. As Regt points out (de Regt, 2017 ), even in the hypothetical case of having a perfect oracle that guarantees empirical correctness, our system could be epistemologically weak. The oracle availability would not dispense the data scientist from the need to open that black box. The scientist needs to understand (apprehend a general scheme of the theory governing that oracle) to be able to qualitatively recognize characteristic consequences of T without performing exact calculations (de Regt, 2017 ).

6.6.1 Supporting Anderson’s Thesis

«80» Alternative reasons can be provided to support the basis on which Anderson builds the argument. It has already been commented that in BD it is not uncommon that one of the first problems is that data scientists themselves are incapable to claim the specific hypotheses to be tested until they perform a first exploration stage (that indeed is based on data curation). This peculiarity leads us to conclude that the classical approach to explaining (validating the hypotheses obtained from models) could be inadequate. Another way to justify this would be arguing that a model is not defined because the dataset is the digital mirror of a CS. As the data scientist does not know how to reconstruct an image (a model) of the complete CS, he/she is merely applies ML to find interesting features of the dataset. This way is how the data scientist explains why a particular event occurs. Even if an explanatory ML technique is selected, it does not guarantee the soundness of the explanation due to previous decisions as in the curation phase (as discussed in Sect. 6.1 ), which outlines the starting conditions in turn.

«81» Even if the reader acknowledges the great misgivings, one should accept that Anderson’s thesis is partly right in the DS practice. His statement We don’t define the conditions of the experiments, so we don’t know what we’re capturing is true insofar as the exploration and the analysis in BD do not always have a starting specific goal. There’s no actual knowledge to validate, but rather the reasons why the data is so, and also to infer properties of reality from this analysis.

«82» At this point, and limiting ourselves to XAI, one would add an (intermediate) third proposal to elucidate the dilemma of Paragraph 79. The idea is to achieve acceptability by finding the justification of the decision taken by the AI-based system in each case, the local solution to XAI (where the explanation is synonymous with outcome justification). Moreover it would be necessary to opt for an interpretation of the concept of justification as something that makes belief objectively more likely to be valid, as opposed to another interpretation of explanation as something that adequately points to belief in the truth. Belief is justified by the fact that it is properly held or based on an adequate method, to the extent that truth is the objective or the norm, a proper-aim justification (Graham, 2010 ).

6.6.2 Absence of Models Versus Veracity

«83» Anderson’s controversial thesis was widely contested [see e.g. Barrowman ( 2014 )]. Indeed his arguments are weakened when one needs particular requirements. For example, when it desires data processing systems allow the data scientist to extract some solid methodology. This is something that scientific theories should do quite well, according to Popper’s insights: may serve to predict like a predictor or also as a key, to tell us what would happen if some important factor is changed in our model.

«84» The absence of models in DS affects four essential scientific dimensions: the causality mentioned above, confidence in the results, the applicability of the model to data other than those used in the training phase, and finally, its ability to explain what is happening. Following our idea sketched in Paragraph 82, the explanation becomes a simple -but limited- explanation of the particular event (local XAI), even accepting the risk that it may be defective . However, the challenge that comes from data curation persists: how do we estimate the suitability of the dataset used (once the Extraction-Transformation-Loading process is applied) to synthesize the solution? The question to address is, in fact, the veracity of the dataset.

«85» The working under the absence of models, the data curation dependence, and the scale versus the semantics challenge suggest that the Veracity becomes a fundamental dimension to consider in BD besides Volume, Variety and Velocity. In some sense, Veracity is the notion built to tackle the problem of the gap between Science and Technology in massive data processing.

«86» When an engineer is working with traditional databases, he/she assumes that the domain is soundly represented by data, being the data a model. In contrast, in BD it is usual to work with untrue datasets: hetereogeneity and unstructured data, missing data, data distortion, incompleteness, noise, etc. These are an actual shortcomings that cause the loss of the security in the inferred results, offered by traditional databases, damaging the link between data and actual entities to that model.

«87» We adopt the meaning of Veracity as referring to how precise or valid a dataset would be, that is to say, to the fidelity of the data concerning the reality that they represent. However, in the context of BD, the term has another additional meaning. Veracity would also encompass the question of the reliability of the data source, and the confidence in the data processing. These are issues to be studied as they play a more relevant role in several questions: biases, anomalies, inconsistencies, and others associated with the processing itself. It becomes a critical issue to study in the new systems AA ( 2015 ), and mandatory if the agents aim to abandon the idea that ML is data alchemy that exempts the explainer to be accountable.

«88» It is particularly relevant the distinction between a rough veracity meaning, associated with the confidence in the digital picture that data represent in DS practices (quality, safety, accuracy, completeness of the information, etc.), and more formalized concepts related to the correctness and validity of the results. Related to the latter, databases could be understood as formal models of the set of definition schemes that govern them [a foundational principle accepted in Deductive Databases Theory, and also in Philosophy of Science (Leonelli, 2016 )]. The latter is, in turn, a formal theory that represents knowledge about the universe from where the data were extracted. If the data scientist identifies both notions, Veracity is closely related to the well-known problems of Knowledge Engineering (or even if a certain standard in database definition schemes is required, to the Semantic Web (Alonso-Jiménez et al., 2006 ). Since the explanation synthesized is based on the data, it depends on truthfulness in both directions.

6.7 Foundational Issues

«89» Bearing in mind the warnings about the veracity and absence (of use) of scientific models in DS discussed above, the reader may agree that solving the explaining problem may hide the problem already mentioned of confusing data with reality. There exists a common foundational issue suffered by the solutions assisted by CAIS. Data scientists do not work with the problem they intend to solve (which may be sociological in nature, for example), but with the data that the problem leaves as a digital footprint instead. From this source they resort to the effectory capacity of DS, and thus solutions rest framed by that. Accordingly, it is reasonable to conclude that the solution offered -the explanation shown about the phenomenon of CAIS behavior- will be intrinsically limited by that datified image, which we call the digital shadow . Even if agents attempt to build a scientific model using data and AI solutions, they could doubt if the source, the digital shadow, has not distorted the informational structure of the reality.

«90» A canonical example is the identification of an individual with the collection of information about him available in BD repositories. The question is not whether the identification is valid (an issue for Privacy researchers), but how identities and experience (dis)appear from BD (Ricker, 2017 ). The overall digital shadow of a CS is only the source from which to provide plans and actions to be applied on the CS. In the era of hyper-connectivity and ubiquity, data scientists are still chained at the bottom of the cave envisioned by Plato from which we only perceive the (digital) shadow of events or objects, ideas and concepts that move through technological reality. The shadow is made up of an unmanageable amount of data that reflect the dynamics or the form of these entities. It is also reflected the (social) customs and attitudes of our fellow human beings and serves to nurture AI-based systems. And as in the cave myth, it is with the shadow that we try to extract properties and understand the original object/event. Please note that the issue is not a sort of perspectivism. It is rather a kind of technological solipsism; the source where perspectival DS practices are nurture.

«91» Unfortunately, the risk of confusing properties from the shadow with properties of the study object will be thus persistent; it could even be considered Anderson’s thesis transcript. Following the metaphor, massive data also suffer distortions as the shadows. The detailed information available, provided by massive data availability can present the data scientist with the illusion that it owns a faithful representation of the object/event, although it is only information about it, and as such, subject to diverse constraints. On the one hand, to interpretation (particularly, to its intentionality itself). On the other hand, and related to the former, to the context from which it was extracted and to the perspectival data scientist labor. Moreover, it is constrained by the engineer’s ability to properly work with the available data. Such factors would limit the capacity of the AI system to extract knowledge from that information, particularly when KL-inspired explanations are sought.

«92» M. Janssen and G. Kuka claim that the greatest risk is that data become reality (Janssen & Kuk, 2016 ). One could assert that a defective or incomplete treatment goes unnoticed if reality is not observed. Hence, it becomes essential to compare the results with data from reality whenever possible. Results from each step of the Data Science project should contrast with that. It is this confrontation that could require a deeper knowledge of the situation. And, at the same time, where another opportunity for scientific theories re-appears (including for Explaining).

«93» A conjecture in this regard is based on the impression -shaped by the practice- of that the closer the data are to represent the intentionality of the system/problem under study (e.g. the study of meme diffusion within a social network can be addressed as the study of a social network devoted to spread memes), the more successful the actions are. We also claim that, possibly, it is due to the human perception of epistemological similarity between the digital shadow and the ingredients the phenomenon it observes. An example would be the provisioning of AI-based powerful tools for monitoring social media opinion/sentiments that are constantly improving, and new challenges addressed (Cambria et al., 2013 ). Nevertheless, when the application scope is more anchored in the CS’s physical reality, the shortcomings of AI-based systems are more clearly observed. They are concerns -wicked problems- as humanitarian crisis (Meier, 2015 ), political/sectarian violence or physical sociological phenomena (Subrahmanian & Kumar, 2017 ), terrorism (Johnson et al., 2015 ), food crises, climate change or sustainable development. It is no wonder they belong to a class of problems that need a major interdisciplinary collaboration to address many relevant aspects as disagreements about what the problem actually is, or even the existence of contradictory solutions (Rittel & Webber, 1973 ). This kind of complex problem requires more than the massive data processing, even challenging the interdisciplinarity itself. Due to the hype phenomenon around AI, the idea that AI-based processing suffices to solve such challenges is so widespread that stakeholders rely on (and finance) this type of application of dubious effectiveness. This is a way of falling in the absence of models issue.

«94» Yet to be discussed is whether some of these foundational issues can be tackled using weapons of similar abstract level in selected case studies. A potential option in this line would be the design of (phenomenological in nature) scientific models having inputs/outputs similar to the event/system, to both explain and provide the causal dependencies between input/output. These are truly XAI-driven surrogate models as those for modeling explaining in complex neural networks. The preventive objection would be that even being mathematical sound, a phenomenological model like this is not necessarily a purely explanatory artifact. Genuine explanatory models attempt to describe something more, namely the mechanism responsible for the various regularities in the phenomenon. The difference is well known. For example, the explanation -utilizing equations- of certain relationships between different variables may be phenomenological in the first instance but not necessarily explanatory. Equations establish the appropriate bridge between input/output data but, in principle, do not describe the parts of any mechanism, equations do no disclose . Since, in phenomenological-based models, the parts the designers postulate have no ontic counterpart in the mechanism, one should prevent speaking about them as a successful representational relationship . Therefore, there is no reason to avoid speaking of mere useful fictions (Barberis, 2012 ). Recall that it had already raised in Paragraph 28 where the need for metaphors and simplifications when trying to move the explainee accessible versions of the explanation [see e.g. the visual metaphor for analysis of argument with ontologies (Borrego-Díaz & Chávez-González, 2006 ; Aranda-Corral & Borrego-Díaz, 2010 )]. Paradoxically, this kind of model for explanation is not purely explanatory in XAI. They are more like extensively appropriate models to support another explanatory theory [an example would be (Evans et al., 2021 )], a more useful theory providing more descriptive answers for a wider range of counterfactual questions (what would happen if things were different) (Barberis, 2012 ). Moreover, other models approaching the system behavior could be built using the program itself (for example employing data mining with Inductive Logic Programming). Other facts confirm the fickleness of these models for Explaining, particularly from the sociological point of view. For example, the adequacy of the explanation from the model may depend on whether it matches its outcomes with the explainee’s expectations (Sect. 2 ). When it is confirmatory -there is a match- factual explanations (which might be closer to an input/output explanation) are accepted, whereas it seems that when there is a mismatch, the counterfactuals can aid, although they are necessary but not sufficient (Riveiro & Thill, 2021 ).

7 Bounded Rationality, Explaining and Logics

«95» It is common to be satisfied with limited explanations of events without a rigorous, plausible, inference. We accept a possibly not the best one but, for example, socially admitted or aligned with certain cognitive preferences. Among the decisions humans make, non-optimal ones abound, even others not rational. Consequently, our explanations may suffer from similar limitations, even our perception itself. For example, human performance on perceptual classification seems to approach that of an ideal observer, but economic decisions (time spent, details of perception, or inconsistent and intransitive preferences) cause its deterioration (Summerfield & Tsetsos, 2015 ).

«96» Extensional appropriate models, as proposed in Paragraph 94, allow deploying (logically) sound tools for reasoning. Particularly, those based on solid mathematical theories [such as Logic Programming (Evans et al., 2021 )]. Despite the essential differences separating the two approaches, retrieving ideas from KRR for BR (and its application in XAI) is a very appealing approach. In particular, bridge construction between both of them becomes a promising research line. There are modeling proposals for some BR techniques as (Lipman, 1999 ) which are axiomatic in nature. Another example could be Glazer and Rubinstein’s proposal (Glazer & Rubinstein, 2012 ) of a logical-oriented formulation of a persuasion model. The latter can be adapted, identifying the listener with the explainee and imposing a series of conditions that the speaker (explainer) must satisfy to be persuaded (accept the explanation).

7.1 Logics for the Inference Link in BR-Inspired Explaining

«97» So far, it has been sketched the epistemological distance separating three fundamental elements: the scientific theory supporting the explaining, the explainee’s literacy and preferences, and the phenomena to be explained itself. In this section, we discuss how to bridge them through the inference link of the explanation, thinking in the representation of inferences closer to the explainee’s. That is, reasoning mechanisms providing successful real-world performance, that do not need to satisfy the requirements of rational inference (Pachur & Biele, 2007 ; Gigerenzer & Goldstein, 1996 ) and that are useful for Explaining.

«98» Human inference -in a general sense- is our essential tool for designing plans and actions to cushion the effects of a CS. If explainers are also human, the question arises as to whether these dealing skills can be exploited to synthesize successful explanations; moreover, whether it is possible simulate explanations similar to those accepted by humans. The problem of simulating human explanations would cover both the search of the explanation and its acceptability. Regarding the search, mechanisms as the discovering of similarities, relations or associations, generalization, abstraction, intuition, or context-sensitivity (Duris, 2018 ) are involved. Likewise, the weakening of the requirements for accepting something as an explanation represents one of the human flexibility skills. The approach to its formalization from KL would thus be the first step.

«99» To guide the right choice of logic for BR, we can adhere to Gabbay and Woods’ Logic Limitation Rule (LLR) to prevent unlimited reasoning (Gabbay & Woods, 2003 ):

A logic is inappropriate for actual agents of type \(\tau\) to the extent to which factors which make for agency of type \(\tau\) are indiscernible in the behavior of the logic’s ideal agents.

According to LLR, a logical formalism for working with realistic (bounded) agents will be inadequate if it does not induce properties that essentially distinguish the agents designed under the new paradigm from UR-based agents. For example, to accomplish LLR, it is usual to mirror cognitive limitations through syntactic restrictions on the logic that outcomes effective limitation of both expressivity and reasoning. The strategy would suffice if it actually affects the behavior of the agent by limiting the inferential process, as is the case of BR-based agents. Regardless of the logic supporting the particular BR technique, it should discuss how to read Explaining as a BR activity.

7.2 Explaining as BR Activity

«100» ELES is related to the sociotechnical realm where the CAIS is applied. This fact would be reflected within the explanandum through nomological components. For instance, through knowledge that has been considered as laws of nature of the problem. In BR, the link should be even stronger, affecting the inference link as well. Simon claimed that the first consequence of BR is that the agent’s intended rationality requires constructing a simplified model of the real situation to deal with it. He/she behaves rationally concerning such a model, thus being such a behavior not even approximately optimal for the real world. Focusing on XAI, the question transcripts to whether the explanation is acceptable according the simplified model, or even or is it just an explanation .

«101» Original Simon’s BR approach can be applied to the task of obtaining acceptable explanations. The problem to be solved would be to explain -or even convince as required in ELES- the decision/observation, working with notions as satisfactory, sufficient or convincing explanation. That is the idea of conformity, meaning that the explanation given is sufficient according to some criteria. The explanation would preserve the basic structure from the KRR point of view (explanans, inference link, explanandum) and should base on consensual knowledge by both agents (that comes from observation beside the knowledge shared by both, eg. laws of nature in the KRR sense discussed above). From the BR approach, one has to prevent the use of abstract models or AI optimization techniques where the solution is reliable if it has the whole universe. For instance, a BR-inspired explanation should (implicitly or explicitly) contain data description and the inference processes used by the explainer, since this circumscribes the context where agents worked.

«102» The explanation synthesis under BR will own specific characteristics. To study the feasibility of an ideal provably optimal explainer agent , it must carry out the following tasks, that come from a refinement of the theoretical framework designed by Lewis et al. ( 2014 ). Firstly, specify the environmental properties in which the explanation will be built. Secondly, design the utility function on the behaviours (which should consider factors that influence the acceptability and confidence of the explainee in the explanation itself). Thirdly, it needs to specify the type of representation and processing models that will be used. And lastly, the model must be constructible (according to bounded agent guidelines).

«103» Instantiating the ideas of Lewis et al. ( 2014 ), three types of theoretical scenarios can be distinguished, where the explainer may work, according to different BR constraints. Optimality explanations would represent those produced by the explainer with no (machine) limitations. In Ecological-optimality explanations , the environment where actions are decided is governed by a given distribution, but without limitations to information processing. That is, some distribution inherent to the input data is consensual between explainer and explainee, but no bounds are imposed to processing. In Bounded-optimality explanations there is some limitation to information processing, which reduces the repertoire of accessible solutions and the associated explanation, the policies . Lastly in the Ecological-bounded-optimality explanations both policy space and information processing are constrained. Acceptability would depend on whether the expected behaviour resulting from the analysis corresponds to the observed behaviour.

«104» A phenomenological factor tied to environmental information accessibility should also be considered, mainly the limitation of observation and/or classification of the event itself. It has already been commented on the role of the inherent limitations of perception (Paragraph 14). If the explainer observes abundant information then shortcomings would be naturally imposed (giving rise to curation and perspectivism practices). These will also affect the efficient codification of information relevant to decision-making, which in turn affects the choice of the best action or strategy (Summerfield & Tsetsos, 2015 ). This would complement Lewis et al’s framework, by including inherent limitations of the perception itself for Explaining.

7.2.1 Variety, BR and Ecological Rationality

«105» The psychological factors sketched in Sect. 2 already justify the need to consider human (even Human-Computer-Interaction, HCI) factors. They are not necessarily linked to the usefulness/goodness of the explanation; rather, they would be related to evaluating the utility or acceptability of BR-based explanations. One of them lies in the fact that an example set could be accepted as justification by the explainee if it offers variety . For instance, an example collection covering very different situations (Landes, 2020 ), with some completeness appearance, would enhance the explanation. Our preference for diversity -associated with the perceived completeness of the case set- can play in favor of the explanation acceptance.

«106» Nevertheless, the variety requirement would hide a balance problem. More diversity in exemplary cases requires more computational resources or more knowledge about the environment than the explainer has or can recover. This need is, in practice, opposed to an intriguing phenomenon in Ecological Rationality (ER) (Goldstein & Gigerenzer, 2002 ): How could more knowledge be no better -or worse- than significantly less knowledge? ER is a particular case of BR practices that contrasts with the classical notion in the social and behavioral sciences such as economics and psychology. The theory of rational choice holds that practical rationality consists of making decisions according to some fixed rules, regardless of the context. In contrast, ER asserts that rationality is essentially context-based. Studies on ER show how humans use what we know in an environment under limited resources. Also, it focuses on the match between an heuristic and the structure of the information in a particular environment. Whilst one of the priorities in Rational Choice Theory is the internal logical consistency, ER focuses on the (external) performance in the world. This aspect moves this conception further away from any notion of logical validity (in the KL sense). We could view ER as the counterpart of the theory of situated agents (Suchman, 1987 ) for BR, where the explaining under variety constraints would be an ER-based practice. The understanding of an expert behaviour in presence of data -an ER topic- aids to support BR-inspired explanations.

«107» The epistemological variety represents a psychological factor (and prospective object to study in BR) that strengthens confidence in the explained hypothesis (Landes, 2020 ). The variety of tests that can be considered in a BR model for Explaining can be grouped in two levels: as a variety of explanations in a given context, and as a variety of contexts that validate the explanation. Such a variety would also be affected by the BR-based selection techniques. We do not discuss this topic here. According to Landes ( 2020 ) the solution may not be sound in general and it must be handled with care.

«108» Analogous features have to be considered for the inference link. Psychology research on heuristics in human inference processing reveals a compendium of skills for which the classic (computational) logic paradigm is not useful to explain the success of several of these, as the Recognition Heuristic (Goldstein & Gigerenzer, 2002 ; Todd, 2007 ). Hopefully, the skills could be both usable and acceptable in the explanation process. For example, the idea of applying BR techniques to tame the CS, has already been considered. This is done by analyzing the expert’s behaviour and reflecting on the process itself. From there, the selection of attributes/characteristics in the decision making process is justified. For example, in the management system of electrical networks (e.g. in smartgrids ), it is being considered to imitate the behavior of engineers in current management National Academies of Sciences, E. and Medicine ( 2016 ) (an ER activity). One of the techniques is the so-called Principle of effective simplicity (from BR): experts can select a relatively small number of variables and observations to diagnose, explain, predict and make decisions. An adequate modeling of the principle could speed up these activities, find useful explanations in the future, and even automate them. However, the main limitation of the explanation lies in that the explanation support is strongly human-dependent; some sort argument from authority due to the incorporation of expert pragmatics in explanans (Vassiliades et al., 2021 ).

7.2.2 Fast-and-Frugal Techniques for Explaining

«109» One of the BR challenges is the modeling of the human expertise to select one or two causes from a, sometimes infinite, number of them, to build the explanation (Miller, 2019 ). Similarly, explanations are selected (in a biased manner) based on the idea that people do not usually expect a complete and faithful causal explanation.

«110» The Fast and Frugal (FaF) methods (Gigerenzer & Goldstein, 1996 ) specify how the information is searched ( search rule ), when the information search ends ( stop rule ) and how the processed information is integrated into a decision ( decision rule ). These approaches soundly work due to their simplicity. Also, it provides regularity in the face of the heterogeneity of the available data. The FaF techniques produce explanations that can benefit from tools to model and evaluate the strategy followed (Phillips et al., 2017 ), selected from the adaptive toolbox , in order to design transparent assistance systems for decision-making (Raab & Gigerenzer, 2015 ). In Table 8 some FaF techniques are adapted to be used in Explaining. On the negative side, explanation strategies based on FaF heuristics have the risk of falling into Cherry Picking , False Causality or Sampling Bias fallacies, all of them related to the initial constraints imposed in FaF.

7.2.3 Generalization by Abstraction

«111» The generalization of explanations (for their reusability) depends on the availability of more data from multiple sources, which also allows the development of richer models and greater understanding. However, when more data are available and curation is absent or defficient, models can become more complex and too detailed to be understandable by the explainee.

«112» Another risk for explaining models’ reusability that comes from (BR-based) data curation is that data bias may lead to the inability to replicate studies for similar problems. This inability undertakes the explanation acceptability itself (Janssen & Kuk, 2016 ). A solution could be the generalization, although the level of abstraction of the explanation can condition it. Premises or conclusions, that are too abstract or general, could compromise both the actual explainee’s understanding and its practical value. Abstractions may simplify explanations, but the discovery of sound abstractions is very challenging (as well as sharing its understanding) (Gunning et al., 2019 ). Such difficulties could lead to a greater gap between scientific rigor and practical relevance. We could claim that BR and generalizations can play opposite roles in the explanation of CAIS behavior.

7.3 Perspectivism and Curation as the Basis for BR-Based Explaining Strategies

«113» It has been emphasized that the agent’s understanding of the environment is a key factor in BR approaches as ER. The selection of features and the available background knowledge on the environment leads to work within a particular context to build the explanation. What is more, the application in Explaining of BR techniques such as contextual selection or effective simplicity leads us to consider that the explanations coming from them are also perspectival in such sense. By framing a context/perspective (induced by the understanding of the phenomena), it can be assumed that the mental space where the agent’s reasoning occurs would be circumscribed to that. Among other consequences, the explainers will select from the available data those that their scientific training indicates them that they are causal , employing BR skills (possibly unconsciously) conditioned by the perspective. The advantage of achieving a consensual perspective lies in its status of ontological commitment about the information ecosystem, having the explanation and the results strengthened and accepted (because facilitates the internalization, Paragraph 24). However, it has already been mentioned that there exists the temptation to explain employing only statistical-computational relationships and principles (some sort of extreme empiricism) to shape the perspective (by means of estimations, bounds, thresholds, etc.). The adoption of BR practices increases the risk of the emergence of perspectives according to non explicit principles.

«114» An example of sound (techno-)perspectivism we refer to is the explanation of the so-called Arab Spring (years 2010–2013) by the western media as a social movement claiming social and political rights [admitting other political factors (Korotayev, 2014 )]. Wikipedia presents a concrete incident in Tunisia as the spigot for the mobilizations (see https://en.wikipedia.org/wiki/Arab_Spring ). One could ask whether this interpretation is not a very tight corset. The thesis is product of a perspective taken by the analysts (perhaps on political wishful thinking). There should exist a spigot if the system is in an unstable state, but it might not be the cause even if we admit it as a causal explanation. Lagi et al. ( 2011 ), through analysis of available data, founded propose another (contributing?) cause. Their analysis shows that the timing of the violent protests in North Africa and the Middle East in 2011 (as well as the earlier riots in 2008) coincides with large increases in global prices of basic foodstuffs of the most vulnerable populations. They even provide an estimate of the price threshold above which riots break out. The example clearly shows that data curation by experts is necessary, rather than massive data analysis to provide acceptable explanations, and also how these can be confronted or need to reconcile with others supported by Social Science. Also, it is an example of the perspectival application of Data Science to outcomes alternative explanations.

8 Conclusions and Future Work

«115» This work reflects the authors’ conceptual journey from AI to a framework where XAI is observed as multidisciplinary in nature. We have discussed the need to adopt particular viewpoints within XAI on two problems: The XAI practices within the Data Science (and BD) universe, and the peremptory need to transmit the explanation (even the trust) on the CAIS to the stakeholder.

«116» Regarding the first of the problems, it has been pointed out the risk of a historical research and development of the XAI. The issue is worrying when DS teams cling to extreme empiricism and fall into the temptation of working without (scientific) models on the reality they study; particularly problematic when the issues they deal with are sensitive for citizenship.

«117» The second is intimately linked to the proper use of CAIS as decision-making assistant, but also as a tool for monitoring or managing CS issues. Our thesis (formed from our standpoint as researchers in KRR-based AI) claims that it is necessary to incorporate the astonishing corpus on Explaining from Philosophy of Science and Technology. The claim does not limit itself to the general principles; it also covers its use to drive the implementation of new technology for XAI. Concerning this issue, some guidelines have been outlined for the case of exploiting BR techniques in XAI. It is interesting for ELES, which represents a socio-technical system when the explaining challenge can become a problem rooted in several issues (for instance, the stakeholder’s literacy).

«118» The development of XAI is spoiled by the incipient and ongoing problems that the widespread use of CAIS is causing in society. The urgent need for explanation (which frequently hides others as that of verification, validation, or certification) means that engineers do not have time to devote the effort needed to achieve actual interdisciplinarity in XAI. The authors hope the paper can convince the AI colleagues that purely technological development can be fast but suffer real shortcomings (that affect their usefulness, safety) that are also rooted in foundational issues and not only in purely pragmatic issues.

«119» The suitability of some formal and philosophical conditions under which BR ideas can be applied in XAI have been investigated. This issue has been treated in a general way, emphasizing its philosophical, computational, and particularly AI dimensions in the field of DS and CS. We have focused on DS socio-technical systems, in contrast to other studies more focused on computational aspects of the decision itself (cf. Främling ( 2020 )). Due to its socio-technical complexity, ELES is a paradigmatic case. The systems and the engineers could not offer an adequate explanation for stakeholders, even it may be doubtful that the technical explanations actually correspond to what happened due to perspectival principles. Moreover, it has also been analyzed the convenience of considering BR techniques to synthesize explanations that may be acceptable to the explainer, although they may suffer from deficiencies derived from BR itelf.

«120» Thus, the relationship between explainability and replicability has not been discussed in depth, even recognizing that the latter represents a good option to achieve the explanation acceptance Guttinger ( 2020 ). We could claim that the explanation of CAIS outcomes to manage CS and could mirror some of the features of the reproducibility crisis, which are becoming more common in modern Physics. However, the identification of the fields to which the replicability standard applies or not is a challenge. Guttinger ( 2020 ) argues that (at least) three different aspects of scientific practice could be used to properly answer this question: the type of questions addressed, the setup used, and the nature of the objects analyzed. From the analysis of CS and the nature of the concept of acceptable explanation for the stakeholder , we can conclude that XAI for working with CS seems to be framed to that grey zone of research practices where there might not be a clear answer to the replicability issue. A case-by-case analysis might be the only sensible way forward, in the same vein as Guttinger ( 2020 ). Also due to space limitations, an analysis of the status of emergence-based explanation for CS has been avoided. The techniques from Agent-Based Modeling can be combined with the macro-vision provided by proposals that exploit the epistemological nature [see e.g. our papers Aranda-Corral et al. ( 2013a , 2013b , 2018 )]. This is a promising topic for a further research.

«121» Finally, another long-term goal to be tackled is the design of an ontology on the analytical elements that precisely define notions associated with the limitation of the agents involved in XAI. It should include concepts such as the goals, behaviors, and different ecological and evaluation environments (Lewis et al., 2014 ). Understanding the explaining as a BR task, any XAI practice of this kind would be representable by specifying the elements playing a relevant role in the case of (bounded) optimal explainer agents. In this way, external agents can contextualize explanations produced in a particular socio-technical system.

«122» In addition, we think that a proper reading of critical works on Data Management practices in particular fields [e.g. Leonelli ( 2016 )] can provide very useful ideas for understanding the data curation process. Lastly, there exists also the possibility of implementing some of these ideas into the software for XAI in Data Science. By asking what can be learnt from these practices in data science, one could extract those which overcome the epistemic losses that data curation can cause.

See Dick’s paper Dick ( 2015 ) for more information on the history and discussions about the origins and difficulties of implementing BR for the first rational agents, and the consideration of heuristics).

https://www.ntsb.gov/news/press-releases/Pages/NR20200211.aspx .

See the link: Data shows Tesla owner experienced repeated glitch days before deadly 2018 crash

AA, V. (2015). The Field Guide to Data Science (2nd ed.). Booz Allen Hamilton.

Addis, T. (2014). Natural and artificial reasoning—an exploration of modelling human thinking. Advanced information and knowledge processing . Springer.

Google Scholar  

Alonso-Jiménez, J. A., Borrego-Daz, J., Chávez-González, A. M., & Martín-Mateos, F. J. (2006). Foundational challenges in automated semantic web data and ontology cleaning. IEEE Intelligent Systems, 21 (1), 42–52.

Article   Google Scholar  

Alrøe, H. F., & Noe, E. (2014). Second-order science of interdisciplinary research: A polyocular framework for wicked problems. Constructivist Foundations, 10 (1), 65–76.

Anderson, C. (2008). The petabyte age: Because more isn’t just more—more is different. Retrieved from http://www.wired.com/2008/06/pb-intro/ .

Anderson, J. D., & Perona, P. (2014). Toward a science of computational ethology. Neuron, 84 (1), 18–31.

Aranda-Corral, G. A. & Borrego-Díaz, J. (2010). Mereotopological analysis of formal concepts in security ontologies. In Herrero, Á., Corchado, E., Redondo, C., & Alonso, Á (Eds.), Computational Intelligence in Security for Information Systems 2010—Proceedings of the 3rd International Conference on Computational Intelligence in Security for Information Systems (CISIS’10), León, Spain, November 11–12, 2010 , Vol. 85 of Advances in Intelligent and Soft Computing (pp. 33–40). Springer.

Aranda-Corral, G. A., Borrego-Díaz, J., & Galán-Páez, J. (2013a). Qualitative reasoning on complex systems from observations. In Hybrid Artificial Intelligent Systems (pp. 202–211). Springer .

Aranda-Corral, G. A., Borrego-Díaz, J., & Giráldez-Cru, J. (2013b). Agent-mediated shared conceptualizations in tagging services. Multimedia Tools Applications, 65 (1), 5–28.

Aranda-Corral, G. A., Borrego-Díaz, J., & Galán-Páez, J. (2018). Synthetizing qualitative (logical) patterns for pedestrian simulation from data. In Bi, Y., Kapoor, S., & Bhatia, R., (Eds.), Proceedings of SAI Intelligent Systems Conference (IntelliSys) 2016 (pp. 243–260). Springer.

Araujo, T., Helberger, N., Kruikemeier, S., & Vreese, C. H. D. (forthcoming). In AI we trust? perceptions about automated decision-making by artificial intelligence. AI and Society 1–13.

Aronson, D. R. (2011). The illusory validity of subjective technical analysis, chapter 2 (pp. 33–101). Wiley.

Awad, E., Dsouza, S., Bonnefon, J.-F., Shariff, A., & Rahwan, I. (2020). Crowdsourcing moral machines. Communications of ACM, 63 (3), 48–55.

Barberis, S. D. (2012). Un análisis crítico de la concepción mecanicista de la explicación. Revista Latinoamericana de Filosofia, 38 (2), 233–265.

Barrowman, N. (2014). Correlation, causation, and confusion. The New Atlantis, 1 (43), 23–44.

Bas, C. V. F. (1980). The Scientific Image . Oxford University Press.

Biewald, L. (2016). The machine learning problem of the next decade . Retrieved from https://www.computerworld.com/article/3023708/the-machine-learning-problem-of-the-next-decade.html .

Booth, S., Muise, C., & Shah, J. (2019). Evaluating the interpretability of the knowledge compilation map: Communicating logical statements effectively. In Kraus, S., (Eds.), Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10–16, 2019 (pp. 5801–5807).

Borenstein, J., Herkert, J. R., & Miller, K. W. (2019). Self-driving cars and engineering ethics: The need for a system level analysis. Science and Engineering Ethics, 25 (2), 383–398.

Borrego-Díaz, J., & Chávez-González, A. M. (2006). Visual ontology cleaning: Cognitive principles and applicability. Lecture Notes in Computer Science. In Y. Sure & J. Domingue (Eds.), The Semantic Web: Research and Applications, 3rd European Semantic Web Conference, ESWC 2006, Budva, Montenegro, June 11–14, 2006, Proceedings (Vol. 4011, pp. 317–331). Springer.

Borrego-Díaz, J., & Páez, J. G. (2022). Knowledge representation for explainable artificial intelligence. Complex & Intelligent Systems 1–23.

Bradshaw, J. M., Hoffman, R. R., Woods, D. D., & Johnson, M. (2013). The seven deadly myths of autonomous systems. IEEE Intelligent Systems, 28 (3), 54–61.

Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013). New avenues in opinion mining and sentiment analysis. IEEE Intelligent Systems, 28 (2), 15–21.

Craver, C. (2007). Explaining the Brain: Mechanisms and the Mosaic Unity of Neuroscience. Oxford scholarship online: Philosophy module . Clarendon Press.

Craver, C. F. (2006). When mechanistic models explain. Synthese, 153 (3), 355–376.

Article   MathSciNet   Google Scholar  

Cugueró-Escofet, N., & Rosanas-Martí, J. (2019). Trust under bounded rationality: Competence, value systems, unselfishness and the development of virtue. Intangible Capital, 15, 1–21.

Darden, L. (2006). Reasoning in biological discoveries: Essays on mechanisms, interfield relations, and anomaly resolution . Cambridge Studies in Philosophy and Biology. Cambridge University Press.

Book   Google Scholar  

DARPA. (2016). Explainable Artificial Intelligence (XAI) Program . Defense Advanced Research Projects Agency: Technical report.

Davis, R., Shrobe, H., & Szolovits, P. (1993). What is a knowledge representation? AI Magazine, 14 (1), 17.

de Fine Licht, K., & de Fine Licht, J. (2020). Artificial intelligence, transparency, and public decision-making. AI Society, 35 (4), 917–926.

de Regt, H. (2017). Understanding Scientific Understanding . Oxford Studies in Philosophy of Science. Oxford University Press.

Dick, S. (2015). Of models and machines: Implementing bounded rationality. Isis, 106 (3), 623–634.

Díez, J. (2014). Scientific w-explanation as ampliative, specialized embedding: A neo-hempelian account. Erkenntnis, 79 (S8), 1413–1443.

Dimitrijević, D. R. (2019). Causal closure of the physical, mental causation, and physics. European Journal for Philosophy of Science, 10 (1), 1.

Doran, D., Schulz, S., & Besold, T. R. (2017). What does explainable AI really mean? A new conceptualization of perspectives. In T. R. Besold, & O. Kutz, (Eds.), Proc. First Int. Workshop on Comprehensibility and Explanation in AI and ML , Volume 2071 of CEUR Workshop Proceedings (pp. 1–8). CEUR-WS.org.

Dudai, Y., & Evers, K. (2014). To simulate or not to simulate: What are the questions? Neuron, 84 (2), 254–261.

Duris, F. (2018). Arguments for the effectiveness of human problem solving. Biologically Inspired Cognitive Architectures, 24, 31–34.

Evans, R., Bošnjak, M., Buesing, L., Ellis, K., Pfau, D., Kohli, P., & Sergot, M. (2021). Making sense of raw input. Artificial Intelligence, 299, 103521.

Article   MathSciNet   MATH   Google Scholar  

Fellous, J.-M., Sapiro, G., Rossi, A., Mayberg, H., & Ferrante, M. (2019). Explainable artificial intelligence for neuroscience: Behavioral neurostimulation. Frontiers in Neuroscience, 13, 1346.

Findl, J., & Suárez, J. (2021). Descriptive understanding and prediction in Covid-19 modelling. History and Philosophy of the Life Sciences, 43 (4), 1–31.

Forrester, A. I. J., Sobester, A., & Keane, A. J. (2008). Engineering design via surrogate modelling—a practical guide . Wiley.

Främling, K. (2020). Decision theory meets explainable AI. In D. Calvaresi, A. Najjar, M. Winikoff, & K. Främling (Eds.), Explainable, transparent autonomous agents and multi-agent systems (pp. 57–74). Springer.

Chapter   Google Scholar  

Füllsack, M. (2014). The circular conditions of second-order science sporadically illustrated with agent-based experiments at the roots of observation. Constructivist Foundations, 10 (1), 46–54.

Gabbay, D. M., & Woods, J. (2003). Chapter 3—logic as a description of a logical agent. In D. M. Gabbay & J. Woods (Eds.), Agenda Relevance, Volume 1 of A Practical Logic of Cognitive Systems (pp. 41–68). Elsevier.

Gajdoš, P., & Snášel, V. (2014). A new FCA algorithm enabling analyzing of complex and dynamic data sets. Soft Computing, 18 (4), 683–694.

Gerbaudo, P. (2020). From data analytics to data hermeneutics. Online political discussions, digital methods and the continuing rel- evance of interpretative approaches. Digital Culture & Society, 2 (2), 95–112.

Gigerenzer, G., & Goldstein, D. G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103 (4), 650–669.

Gigerenzer, G., Martignon, L., Hoffrage, U., Rieskamp, J., Czerlinski, J., & Goldstein, D. G. (2008). One-reason decision making, Chapter 108 , (Vol. 1, pp. 1004–1017). Elsevier.

Gigerenzer, G., & Selten, R. (2002). Bounded rationality: The adaptive toolbox . MIT Press.

Giráldez-Cru, J., & Levy, J. (2016). Generating SAT instances with community structure. Artificial Intelligence, 238, 119–134.

Glazer, J., & Rubinstein, A. (2012). A model of persuasion with boundedly rational agents. Journal of Political Economy, 120 (6), 1057–1082.

Goebel, R., Chander, A., Holzinger, K., Lecue, F., Akata, Z., Stumpf, S., et al. (2018). Explainable AI: The new 42? In A. Holzinger, P. Kieseberg, A. M. Tjoa, & E. Weippl (Eds.), Machine learning and knowledge extraction (pp. 295–303). Springer.

Goldstein, D., & Gigerenzer, G. (2002). Models of ecological rationality: The recognition heuristic. Psychological Review, 109, 75–90.

Graham, P. J. (2010). Theorizing justification. In Knowledge and skepticism (pp. 45–71). MIT Press.

Gunning, D., Stefik, M., Choi, J., Miller, T., Stumpf, S., & Yang, G.-Z. (2019). Xai-explainable artificial intelligence. Science Robotics, 4 (37), 7120.

Guttinger, S. (2020). The limits of replicability. European Journal for Philosophy of Science, 10 (2), 10.

Hedström, P., & Ylikoski, P. (2010). Causal mechanisms in the social sciences. Annual Review of Sociology, 36 (1), 49–67.

Hempel, C. (1970). Aspects of scientific explanation: And other essays in the philosophy of science . Number v. 2 in Aspects of Scientific Explanation: And Other Essays in the Philosophy of Science. Free Press.

Hernandez, J., & Ortega, R. (2019). Bounded rationality in decision-making. MOJ Research Review, 2 (1), 1–8.

Hinsen, K. (2014). Computational science: Shifting the focus from tools to models. F1000Research, 3 (101), 1–15.

Hofman, J., Watts, D. J., Athey, S., Garip, F., Griffiths, T. L., Kleinberg, J., et al. (2021). Integrating explanation and prediction in computational social science. Nature, 595 (7866), 181–188.

Huneman, P. (2018). Outlines of a theory of structural explanations. Philosophical Studies, 175 (3), 665–702.

Hung, T. & Yen, C. (2020). On the person-based predictive policing of AI. Ethics and Information Technology .

Hvoreckỳ, J., Šimúth, J., & Lichardus, B. (2013). Managing rational and not-fully-rational knowledge. Acta Polytechnica Hungarica, 10 (2), 121–132.

Ihde, D. (2010). Heidegger’s technologies: Postphenomenological perspectives . Fordham University Press.

Janssen, M., Hartog, M., Matheus, R., Ding, A. Y., & Kuk, G. (2021). Will algorithms blind people? The effect of explainable AI and decision-makers’ experience on AI-supported decision-making in government. Social Science Computer Review , 0894439320980118.

Janssen, M., & Kuk, G. (2016). Big and open linked data (bold) in research, policy, and practice. Journal of Organizational Computing and Electronic Commerce, 26 (1–2), 3–13.

Jarke, J., & Macgilchrist, F. (2021). Dashboard stories: How narratives told by predictive analytics reconfigure roles, risk and sociality in education. Big Data & Society, 8 (1), 20539517211025560.

Johnson, N. F., Restrepo, E. M., & Johnson, D. E. (2015). Modeling human conflict and terrorism across geographic scales, Chapter 10 (pp. 209–233). Springer.

Jones, A. J., Artikis, A., & Pitt, J. (2013). The design of intelligent socio-technical systems. Artificial Intelligence Review, 39 (1), 5–20.

Kim, J. (2005). Physicalism, or something near enough . Princeton University Press.

King, M. (2020). Explanations and candidate explanations in physics. European Journal for Philosophy of Science, 10 (1), 7.

Klein, G., Moon, B., & Hoffman, R. (2006). Making sense of sensemaking 2: A macrocognitive model. IEEE Intelligent Systems, 21, 88–92.

Kliegr, T., Bahník, Štěpán, & Fürnkranz, J. (2021). A review of possible effects of cognitive biases on interpretation of rule-based machine learning models. Artificial Intelligence, 295, 103458.

Koehler, D. (1991). Explanation, imagination, and confidence in judgment. Psychological Bulletin, 110, 499–519.

Korotayev, A. (2014). The Arab spring: A quantitative analysis. Arab Studies Quarterly, 36, 149–169.

Kridalukmana, R., Lu, H. Y., & Naderpour, M. (2020). A supportive situation awareness model for human-autonomy teaming in collaborative driving. Theoretical Issues in Ergonomics Science , 1–26.

Kroes, P., Franssen, M., Poel, I., & Ottens, M. (2006). Treating socio-technical systems as engineering systems: Some conceptual problems. Systems Research and Behavioral Science, 23, 803–814.

Kroes, P., & Verbeek, P. (2014). The moral status of technical artefacts. Philosophy of Engineering and Technology . Springer.

Lagi, M., Bertrand, K. Z., & By, Y. (2011). The food crises and political instability in North Africa and the middle east. SSRN, 20 (1), 1–15.

Landes, J. (2020). Variety of evidence and the elimination of hypotheses. European Journal for Philosophy of Science, 10 (2), 12.

Leonelli, S. (2016). Data-centric biology: A philosophical study . University of Chicago Press.

Lewis, R. L., Howes, A. D., & Singh, S. (2014). Computational rationality: Linking mechanism and behavior through bounded utility maximization. Topics in Cognitive Science, 6 (2), 279–311.

Li, J., & Liu, H. (2017). Challenges of feature selection for big data analytics. IEEE Intelligent Systems, 32 (2), 9–15.

Lieto, A., Lebiere, C., & Oltramari, A. (2018). The knowledge level in cognitive architectures: Current limitations and possible developments. Cognitive Systems Research, 48, 39–55.

Lipman, B. L. (1999). Decision theory without logical omniscience: Toward an axiomatic framework for bounded rationality. The Review of Economic Studies, 66 (2), 339–361.

Lipton, Z. C. (2018). The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16 (3), 31–57.

Lombrozo, T. (2007). Simplicity and probability in causal explanation. Cognitive Psychology, 55 (3), 232–257.

Longino, H. E. (2006). Theoretical pluralism and the scientific study of behavior, Chapter 6 (Vol. 19, pp. 102–131). University of Minnesota Press, ned—new edition edition.

Lundberg, S. M. & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS’17 (pp. 4768–4777). Curran Associates Inc.

Margolis, J. (1983). The logic and structures of fictional narrative. Philosophy and Literature, 7 (2), 162–181.

Meier, P. (2015). Digital humanitarians: How big data is changing the face of humanitarian response . CRC Press Inc.

Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267, 1–38.

Moreira, C. (2019). Unifying decision-making: A review on evolutionary theories on rationality and cognitive biases, , Chapter 19 (pp. 235–248). Springer.

Müller, K. H., & Riegler, A. (2014). Second-order science: A vast and largely unexplored science frontier. Constructivist Foundations, 10 (1), 7–15.

National Academies of Sciences, E. and Medicine. (2016). In Refining the Concept of Scientific Inference When Working with Big Data: Proceedings of a Workshop—in Brief . The National Academies Press.

Newell, A. (1982). The knowledge level. Artificial Intelligence, 18 (1), 87–127.

Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. (2016). A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104 (1), 11–33.

Pachur, T., & Biele, G. (2007). Forecasting from ignorance: The use and usefulness of recognition in lay predictions of sports events. Acta Psychologica, 125 (1), 99–116.

Páez, A. (2009). Artificial explanations: The epistemological interpretation of explanation in AI. Synthese, 170 (1), 131–146.

Papineau, D. (2001). The rise of physicalism, Chapter 1 (pp. 3–36).

Pearl, J. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3 (none), 96–146.

Phillips, N., Neth, H., Woike, J., & Gaissmaier, W. (2017). Fftrees : A toolbox to create, visualize, and evaluate fast-and-frugal decision trees. Judgment and Decision Making, 12, 344–368.

Pirolli, P. & Card, S. (2005). The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis. In Proceedings of International Conference on Intelligence Analysis (pp. 2–4).

Price, M., Walker, S., & Wiley, W. (2018). The machine beneath: Implications of artificial intelligence in strategic decision making. PRISM, 7 (4), 92–105.

Raab, M., & Gigerenzer, G. (2015). The power of simplicity: A fast-and-frugal heuristics approach to performance science. Frontiers in Psychology, 6, 1672.

Rago, A., Cocarascu, O., Bechlivanidis, C., Lagnado, D., & Toni, F. (2021). Argumentative explanations for interactive recommendations. Artificial Intelligence, 296, 103506.

Reutlinger, A. (2014). Why is there universal macrobehavior? Renormalization group explanation as non-causal explanation. Philosophy of Science, 81 (5), 1157–1170.

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “why should i trust you?”: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD ’16 (pp. 1135–1144). Association for Computing Machinery.

Ricker, B. (2017). Reflexivity, positionality and rigor in the context of big data research, Chapter 4 (pp. 96–118). University of Iowa Press.

Rittel, H. W. J., & Webber, M. M. (1973). Dilemmas in a general theory of planning. Policy Sciences, 4 (2), 155–169.

Riveiro, M., & Thill, S. (2021). “that’s (not) the output i expected!’’ on the role of end user expectations in creating explanations of AI systems. Artificial Intelligence, 298, 103507.

Article   MATH   Google Scholar  

Romele, A., Severo, M., & Furia, P. (2020). Digital hermeneutics: From interpreting with machines to interpretational machines. AI and Society , 1–14.

Russell, S. J., & Norvig, P. (2003). Artificial Intelligence: A modern approach (2nd ed.). Pearson Education.

MATH   Google Scholar  

Russell, S. J., & Subramanian, D. (1995). Provably bounded-optimal agents. The Journal of Artificial Intelligence Research, 2 (1), 575–609.

Salmon, W., & Press, P. U. (1984). Scientific explanation and the causal structure of the world. LPE Limited Paperback Editions . Princeton University Press.

Schupbach, J. N. (2019). Conjunctive explanations and inference to the best explanation. Teorema: Revista Internacional de Filosofía, 38 (3), 143–162.

Simon, H. (1957a). A behavioural model of rational choice. In H. Simon (Ed.), Models of man: Social and rational; mathematical essays on rational human behavior in a social setting (pp. 241–260). Wiley.

Simon, H. A. (1957b). Models of Man: Social and rational: Mathematical essays on rational human behavior in a social setting . Garland Publishing, Incorporated: Continuity in Administrative Science. Ancestral Books in the Management of Organizations.

Slack, D., Hilgard, S., Singh, S., & Lakkaraju, H. (2021). Feature attributions and counterfactual explanations can be manipulated. CoRR .

Stepin, I., Alonso, J. M., Catala, A., & Pereira-Fariña, M. (2021). A survey of contrastive and counterfactual explanation generation methods for explainable artificial intelligence. IEEE Access, 9, 11974–12001.

Stern, L. (2005). Interpretive reasoning . Cornell University Press.

Subrahmanian, V. S., & Kumar, S. (2017). Predicting human behavior: The next frontiers. Science, 355 (6324), 489–489.

Suchman, L. A. (1987). Plans and situated actions: The problem of human-machine communication . Cambridge University Press.

Sullivan, E. (2019). Universality caused: The case of renormalization group explanation. European Journal for Philosophy of Science, 9 (3), 36.

Summerfield, C., & Tsetsos, K. (2015). Do humans make good decisions? Trends in Cognitive Sciences, 19 (1), 27–34.

Todd, P. M. (2007). How much information do we need? The European Journal of Operational Research, 177 (3), 1317–1332.

Townsend, J., Chaton, T., & Monteiro, J. M. (2019). Extracting relational explanations from deep neural networks: a survey from a neural-symbolic perspective. IEEE Transactions on Neural Networks and Learning Systems (pp. 1–15).

van de Poel, I. (2020). Embedding values in Artificial Intelligence (AI) systems. Minds and Machines .

van der Waa, J., Nieuwburg, E., Cremers, A. H. M., & Neerincx, M. A. (2021). Evaluating XAI: A comparison of rule-based and example-based explanations. Artificial Intelligence, 291, 103404.

Vassiliades, A., Bassiliades, N., & Patkos, T. (2021). Argumentation and explainable artificial intelligence: A survey. The Knowledge Engineering Review, 36, e5.

Wang, X., Wang, D., Xu, C., He, X., Cao, Y., & Chua, T. (2019). Explainable reasoning over knowledge graphs for recommendation. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019 (pp. 5329–5336). AAAI Press.

Weld, D. S. & Bansal, G. (2018). Intelligible artificial intelligence. CoRR .

Weld, D. S., & Bansal, G. (2019). The challenge of crafting intelligible intelligence. Communications of the ACM, 62 (6), 70–79.

Wong, J. C. (2020). Computational causal inference.

Woodward, J. (2019). Scientific explanation. In Zalta, E. N., (Eds.) The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, winter 2019 edition.

Download references

Acknowledgements

This work is supported by Spanish State Investigation Agency (Agencia Estatal de Investigación), Project PID2019-109152GB-I00/AEI/10.13039/501100011033. We are very grateful to the reviewers for their suggestions, and for guidance on additional references that have enriched our work.

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.

Author information

Authors and affiliations.

Departamento de Ciencias de la Computación e Inteligencia Artificial, E.T.S. Ingeniería Informática – Universidad de Sevilla, Avda. Reina Mercedes s.n., 41013, Sevilla, Spain

Joaquín Borrego-Díaz & Juan Galán-Páez

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Joaquín Borrego-Díaz .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Borrego-Díaz, J., Galán-Páez, J. Explainable Artificial Intelligence in Data Science. Minds & Machines 32 , 485–531 (2022). https://doi.org/10.1007/s11023-022-09603-z

Download citation

Received : 25 October 2021

Accepted : 17 April 2022

Published : 12 May 2022

Issue Date : September 2022

DOI : https://doi.org/10.1007/s11023-022-09603-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Explainable Artificial Intelligence
  • Data science
  • Complex systems
  • Bounded rationality
  • Symbolic Artificial Intelligence
  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

Here’s how you know

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock A locked padlock ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

https://www.nist.gov/artificial-intelligence

AI Hero Image

Artificial intelligence

NIST aims to cultivate trust in the design, development, use and governance of Artificial Intelligence (AI) technologies and systems in ways that enhance safety and security and improve quality of life. NIST focuses on improving measurement science, technology, standards and related tools — including evaluation and data.

With AI and Machine Learning (ML) changing how society addresses challenges and opportunities, the trustworthiness of AI technologies is critical. Trustworthy AI systems are those demonstrated to be valid and reliable; safe, secure and resilient; accountable and transparent; explainable and interpretable; privacy-enhanced; and fair with harmful bias managed. The agency’s AI goals and activities are driven by its statutory mandates, Presidential Executive Orders and policies, and the needs expressed by U.S. industry, the global research community, other federal agencies,and civil society.

NIST’s AI goals include:

  • Conduct fundamental research to advance trustworthy AI technologies.
  • Apply AI research and innovation across the NIST Laboratory Programs.
  • Establish benchmarks, data and metrics to evaluate AI technologies.
  • Lead and participate in development of technical AI standards.
  • Contribute technical expertise to discussions and development of AI policies.

NIST’s AI efforts fall in several categories:

Fundamental AI Research

NIST’s AI portfolio includes fundamental research to advance the development of AI technologies — including software, hardware, architectures and the ways humans interact with AI technology and AI-generated information  

Applied AI Research

AI approaches are increasingly an essential component in new research. NIST scientists and engineers use various machine learning and AI tools to gain a deeper understanding of and insight into their research. At the same time, NIST laboratory experiences with AI are leading to a better understanding of AI’s capabilities and limitations.

Test, Evaluation, Validation, and Verification (TEVV)

With a long history of working with the community to advance tools, standards and test beds, NIST increasingly is focusing on the sociotechnical evaluation of AI.  

Voluntary Consensus-Based Standards

NIST leads and participates in the development of technical standards, including international standards, that promote innovation and public trust in systems that use AI. A broad spectrum of standards for AI data, performance and governance are a priority for the use and creation of trustworthy and responsible AI.

A fact sheet describes NIST's AI programs .

Featured Content

Artificial intelligence topics.

  • AI Test, Evaluation, Validation and Verification (TEVV)
  • Fundamental AI
  • Hardware for AI
  • Machine learning
  • Trustworthy and Responsible AI

Stay in Touch

Sign up for our newsletter to stay up to date with the latest research, trends, and news for Artificial intelligence.

The Research

Projects & programs, deep learning for mri reconstruction and analysis.

circuit

Emerging Hardware for Artificial Intelligence

Embodied ai and data generation for manufacturing robotics, deep generative modeling for communication systems testing and data sharing.

JARVIS-ML overview

Additional Resources Links

Composite image representing artificial intelligence. Image of graphic human head with images representing healthcare, cybersecurity, transportation, energy, robotics, and manufacturing.

NIST Launches Trustworthy and Responsible AI Resource Center (AIRC)

One-stop shop offers industry, government and academic stakeholders knowledge of AI standards, measurement methods and metrics, data sets, and other resources.

In front of a laptop computer, a hand holds a cell phone that has a conversation with generative AI on the phone screen.

Minimizing Harms and Maximizing the Potential of Generative AI

Eight images show the same person, four wearing glasses and four without, and all with different face expressions. Label says: Database Facial Expressions.

NIST Reports First Results From Age Estimation Software Evaluation

ARIA illustration in blue and green with floating circuits and the silhouette of a person's face.

NIST Launches ARIA, a New Program to Advance Sociotechnical Testing and Evaluation for AI

The letters "AI" appear in blue on a background of binary numbers, ones and zeros.

U.S. Secretary of Commerce Gina Raimondo Releases Strategic Vision on AI Safety, Announces Plan for Global Cooperation Among AI Safety Institutes

Bias in AI

2024 Artificial Intelligence for Materials Science (AIMS) Workshop

A new future of work: The race to deploy AI and raise skills in Europe and beyond

At a glance.

Amid tightening labor markets and a slowdown in productivity growth, Europe and the United States face shifts in labor demand, spurred by AI and automation. Our updated modeling of the future of work finds that demand for workers in STEM-related, healthcare, and other high-skill professions would rise, while demand for occupations such as office workers, production workers, and customer service representatives would decline. By 2030, in a midpoint adoption scenario, up to 30 percent of current hours worked could be automated, accelerated by generative AI (gen AI). Efforts to achieve net-zero emissions, an aging workforce, and growth in e-commerce, as well as infrastructure and technology spending and overall economic growth, could also shift employment demand.

By 2030, Europe could require up to 12 million occupational transitions, double the prepandemic pace. In the United States, required transitions could reach almost 12 million, in line with the prepandemic norm. Both regions navigated even higher levels of labor market shifts at the height of the COVID-19 period, suggesting that they can handle this scale of future job transitions. The pace of occupational change is broadly similar among countries in Europe, although the specific mix reflects their economic variations.

Businesses will need a major skills upgrade. Demand for technological and social and emotional skills could rise as demand for physical and manual and higher cognitive skills stabilizes. Surveyed executives in Europe and the United States expressed a need not only for advanced IT and data analytics but also for critical thinking, creativity, and teaching and training—skills they report as currently being in short supply. Companies plan to focus on retraining workers, more than hiring or subcontracting, to meet skill needs.

Workers with lower wages face challenges of redeployment as demand reweights toward occupations with higher wages in both Europe and the United States. Occupations with lower wages are likely to see reductions in demand, and workers will need to acquire new skills to transition to better-paying work. If that doesn’t happen, there is a risk of a more polarized labor market, with more higher-wage jobs than workers and too many workers for existing lower-wage jobs.

Choices made today could revive productivity growth while creating better societal outcomes. Embracing the path of accelerated technology adoption with proactive worker redeployment could help Europe achieve an annual productivity growth rate of up to 3 percent through 2030. However, slow adoption would limit that to 0.3 percent, closer to today’s level of productivity growth in Western Europe. Slow worker redeployment would leave millions unable to participate productively in the future of work.

Businessman and skilled worker in high tech enterprise, using VR glasses - stock photo

Demand will change for a range of occupations through 2030, including growth in STEM- and healthcare-related occupations, among others

This report focuses on labor markets in nine major economies in the European Union along with the United Kingdom, in comparison with the United States. Technology, including most recently the rise of gen AI, along with other factors, will spur changes in the pattern of labor demand through 2030. Our study, which uses an updated version of the McKinsey Global Institute future of work model, seeks to quantify the occupational transitions that will be required and the changing nature of demand for different types of jobs and skills.

Our methodology

We used methodology consistent with other McKinsey Global Institute reports on the future of work to model trends of job changes at the level of occupations, activities, and skills. For this report, we focused our analysis on the 2022–30 period.

Our model estimates net changes in employment demand by sector and occupation; we also estimate occupational transitions, or the net number of workers that need to change in each type of occupation, based on which occupations face declining demand by 2030 relative to current employment in 2022. We included ten countries in Europe: nine EU members—the Czech Republic, Denmark, France, Germany, Italy, Netherlands, Poland, Spain, and Sweden—and the United Kingdom. For the United States, we build on estimates published in our 2023 report Generative AI and the future of work in America.

We included multiple drivers in our modeling: automation potential, net-zero transition, e-commerce growth, remote work adoption, increases in income, aging populations, technology investments, and infrastructure investments.

Two scenarios are used to bookend the work-automation model: “late” and “early.” For Europe, we modeled a “faster” scenario and a “slower” one. For the faster scenario, we use the midpoint—the arithmetical average between our late and early scenarios. For the slower scenario, we use a “mid late” trajectory, an arithmetical average between a late adoption scenario and the midpoint scenario. For the United States, we use the midpoint scenario, based on our earlier research.

We also estimate the productivity effects of automation, using GDP per full-time-equivalent (FTE) employee as the measure of productivity. We assumed that workers displaced by automation rejoin the workforce at 2022 productivity levels, net of automation, and in line with the expected 2030 occupational mix.

Amid tightening labor markets and a slowdown in productivity growth, Europe and the United States face shifts in labor demand, spurred not only by AI and automation but also by other trends, including efforts to achieve net-zero emissions, an aging population, infrastructure spending, technology investments, and growth in e-commerce, among others (see sidebar, “Our methodology”).

Our analysis finds that demand for occupations such as health professionals and other STEM-related professionals would grow by 17 to 30 percent between 2022 and 2030, (Exhibit 1).

By contrast, demand for workers in food services, production work, customer services, sales, and office support—all of which declined over the 2012–22 period—would continue to decline until 2030. These jobs involve a high share of repetitive tasks, data collection, and elementary data processing—all activities that automated systems can handle efficiently.

Up to 30 percent of hours worked could be automated by 2030, boosted by gen AI, leading to millions of required occupational transitions

By 2030, our analysis finds that about 27 percent of current hours worked in Europe and 30 percent of hours worked in the United States could be automated, accelerated by gen AI. Our model suggests that roughly 20 percent of hours worked could still be automated even without gen AI, implying a significant acceleration.

These trends will play out in labor markets in the form of workers needing to change occupations. By 2030, under the faster adoption scenario we modeled, Europe could require up to 12.0 million occupational transitions, affecting 6.5 percent of current employment. That is double the prepandemic pace (Exhibit 2). Under a slower scenario we modeled for Europe, the number of occupational transitions needed would amount to 8.5 million, affecting 4.6 percent of current employment. In the United States, required transitions could reach almost 12.0 million, affecting 7.5 percent of current employment. Unlike Europe, this magnitude of transitions is broadly in line with the prepandemic norm.

Both regions navigated even higher levels of labor market shifts at the height of the COVID-19 period. While these were abrupt and painful to many, given the forced nature of the shifts, the experience suggests that both regions have the ability to handle this scale of future job transitions.

Smiling female PhD student discussing with man at desk in innovation lab - stock photo

Businesses will need a major skills upgrade

The occupational transitions noted above herald substantial shifts in workforce skills in a future in which automation and AI are integrated into the workplace (Exhibit 3). Workers use multiple skills to perform a given task, but for the purposes of our quantification, we identified the predominant skill used.

Demand for technological skills could see substantial growth in Europe and in the United States (increases of 25 percent and 29 percent, respectively, in hours worked by 2030 compared to 2022) under our midpoint scenario of automation adoption (which is the faster scenario for Europe).

Demand for social and emotional skills could rise by 11 percent in Europe and by 14 percent in the United States. Underlying this increase is higher demand for roles requiring interpersonal empathy and leadership skills. These skills are crucial in healthcare and managerial roles in an evolving economy that demands greater adaptability and flexibility.

Conversely, demand for work in which basic cognitive skills predominate is expected to decline by 14 percent. Basic cognitive skills are required primarily in office support or customer service roles, which are highly susceptible to being automated by AI. Among work characterized by these basic cognitive skills experiencing significant drops in demand are basic data processing and literacy, numeracy, and communication.

Demand for work in which higher cognitive skills predominate could also decline slightly, according to our analysis. While creativity is expected to remain highly sought after, with a potential increase of 12 percent by 2030, work activities characterized by other advanced cognitive skills such as advanced literacy and writing, along with quantitative and statistical skills, could decline by 19 percent.

Demand for physical and manual skills, on the other hand, could remain roughly level with the present. These skills remain the largest share of workforce skills, representing about 30 percent of total hours worked in 2022. Growth in demand for these skills between 2022 and 2030 could come from the build-out of infrastructure and higher investment in low-emissions sectors, while declines would be in line with continued automation in production work.

Business executives report skills shortages today and expect them to worsen

A survey we conducted of C-suite executives in five countries shows that companies are already grappling with skills challenges, including a skills mismatch, particularly in technological, higher cognitive, and social and emotional skills: about one-third of the more than 1,100 respondents report a shortfall in these critical areas. At the same time, a notable number of executives say they have enough employees with basic cognitive skills and, to a lesser extent, physical and manual skills.

Within technological skills, companies in our survey reported that their most significant shortages are in advanced IT skills and programming, advanced data analysis, and mathematical skills. Among higher cognitive skills, significant shortfalls are seen in critical thinking and problem structuring and in complex information processing. About 40 percent of the executives surveyed pointed to a shortage of workers with these skills, which are needed for working alongside new technologies (Exhibit 4).

Two IT co-workers code on laptop or technology for testing, web design or online startup - stock photo

Companies see retraining as key to acquiring needed skills and adapting to the new work landscape

Surveyed executives expect significant changes to their workforce skill levels and worry about not finding the right skills by 2030. More than one in four survey respondents said that failing to capture the needed skills could directly harm financial performance and indirectly impede their efforts to leverage the value from AI.

To acquire the skills they need, companies have three main options: retraining, hiring, and contracting workers. Our survey suggests that executives are looking at all three options, with retraining the most widely reported tactic planned to address the skills mismatch: on average, out of companies that mentioned retraining as one of their tactics to address skills mismatch, executives said they would retrain 32 percent of their workforce. The scale of retraining needs varies in degree. For example, respondents in the automotive industry expect 36 percent of their workforce to be retrained, compared with 28 percent in the financial services industry. Out of those who have mentioned hiring or contracting as their tactics to address the skills mismatch, executives surveyed said they would hire an average of 23 percent of their workforce and contract an average of 18 percent.

Occupational transitions will affect high-, medium-, and low-wage workers differently

All ten European countries we examined for this report may see increasing demand for top-earning occupations. By contrast, workers in the two lowest-wage-bracket occupations could be three to five times more likely to have to change occupations compared to the top wage earners, our analysis finds. The disparity is much higher in the United States, where workers in the two lowest-wage-bracket occupations are up to 14 times more likely to face occupational shifts than the highest earners. In Europe, the middle-wage population could be twice as affected by occupational transitions as the same population in United States, representing 7.3 percent of the working population who might face occupational transitions.

Enhancing human capital at the same time as deploying the technology rapidly could boost annual productivity growth

About quantumblack, ai by mckinsey.

QuantumBlack, McKinsey’s AI arm, helps companies transform using the power of technology, technical expertise, and industry experts. With thousands of practitioners at QuantumBlack (data engineers, data scientists, product managers, designers, and software engineers) and McKinsey (industry and domain experts), we are working to solve the world’s most important AI challenges. QuantumBlack Labs is our center of technology development and client innovation, which has been driving cutting-edge advancements and developments in AI through locations across the globe.

Organizations and policy makers have choices to make; the way they approach AI and automation, along with human capital augmentation, will affect economic and societal outcomes.

We have attempted to quantify at a high level the potential effects of different stances to AI deployment on productivity in Europe. Our analysis considers two dimensions. The first is the adoption rate of AI and automation technologies. We consider the faster scenario and the late scenario for technology adoption. Faster adoption would unlock greater productivity growth potential but also, potentially, more short-term labor disruption than the late scenario.

The second dimension we consider is the level of automated worker time that is redeployed into the economy. This represents the ability to redeploy the time gained by automation and productivity gains (for example, new tasks and job creation). This could vary depending on the success of worker training programs and strategies to match demand and supply in labor markets.

We based our analysis on two potential scenarios: either all displaced workers would be able to fully rejoin the economy at a similar productivity level as in 2022 or only some 80 percent of the automated workers’ time will be redeployed into the economy.

Exhibit 5 illustrates the various outcomes in terms of annual productivity growth rate. The top-right quadrant illustrates the highest economy-wide productivity, with an annual productivity growth rate of up to 3.1 percent. It requires fast adoption of technologies as well as full redeployment of displaced workers. The top-left quadrant also demonstrates technology adoption on a fast trajectory and shows a relatively high productivity growth rate (up to 2.5 percent). However, about 6.0 percent of total hours worked (equivalent to 10.2 million people not working) would not be redeployed in the economy. Finally, the two bottom quadrants depict the failure to adopt AI and automation, leading to limited productivity gains and translating into limited labor market disruptions.

Managers discussing work while futuristic AI computer vision analyzing, ccanning production line - stock photo

Four priorities for companies

The adoption of automation technologies will be decisive in protecting businesses’ competitive advantage in an automation and AI era. To ensure successful deployment at a company level, business leaders can embrace four priorities.

Understand the potential. Leaders need to understand the potential of these technologies, notably including how AI and gen AI can augment and automate work. This includes estimating both the total capacity that these technologies could free up and their impact on role composition and skills requirements. Understanding this allows business leaders to frame their end-to-end strategy and adoption goals with regard to these technologies.

Plan a strategic workforce shift. Once they understand the potential of automation technologies, leaders need to plan the company’s shift toward readiness for the automation and AI era. This requires sizing the workforce and skill needs, based on strategically identified use cases, to assess the potential future talent gap. From this analysis will flow details about the extent of recruitment of new talent, upskilling, or reskilling of the current workforce that is needed, as well as where to redeploy freed capacity to more value-added tasks.

Prioritize people development. To ensure that the right talent is on hand to sustain the company strategy during all transformation phases, leaders could consider strengthening their capabilities to identify, attract, and recruit future AI and gen AI leaders in a tight market. They will also likely need to accelerate the building of AI and gen AI capabilities in the workforce. Nontechnical talent will also need training to adapt to the changing skills environment. Finally, leaders could deploy an HR strategy and operating model to fit the post–gen AI workforce.

Pursue the executive-education journey on automation technologies. Leaders also need to undertake their own education journey on automation technologies to maximize their contributions to their companies during the coming transformation. This includes empowering senior managers to explore automation technologies implications and subsequently role model to others, as well as bringing all company leaders together to create a dedicated road map to drive business and employee value.

AI and the toolbox of advanced new technologies are evolving at a breathtaking pace. For companies and policy makers, these technologies are highly compelling because they promise a range of benefits, including higher productivity, which could lift growth and prosperity. Yet, as this report has sought to illustrate, making full use of the advantages on offer will also require paying attention to the critical element of human capital. In the best-case scenario, workers’ skills will develop and adapt to new technological challenges. Achieving this goal in our new technological age will be highly challenging—but the benefits will be great.

Eric Hazan is a McKinsey senior partner based in Paris; Anu Madgavkar and Michael Chui are McKinsey Global Institute partners based in New Jersey and San Francisco, respectively; Sven Smit is chair of the McKinsey Global Institute and a McKinsey senior partner based in Amsterdam; Dana Maor is a McKinsey senior partner based in Tel Aviv; Gurneet Singh Dandona is an associate partner and a senior expert based in New York; and Roland Huyghues-Despointes is a consultant based in Paris.

Explore a career with us

Related articles.

""

Generative AI and the future of work in America

McKinsey partners Lareina Yee and Michael Chui

The economic potential of generative AI: The next productivity frontier

What every CEO should know about generative AI

What every CEO should know about generative AI

IMAGES

  1. A collection of recommendable papers and articles on Explainable AI (XAI)

    research papers on explainable ai

  2. (PDF) What Do You See?: Evaluation of Explainable Artificial

    research papers on explainable ai

  3. (PDF) Interaction Design for Explainable AI: Workshop Proposal

    research papers on explainable ai

  4. Explainable AI via Learning to Optimize

    research papers on explainable ai

  5. artificial intelligence research paper 2019 pdf

    research papers on explainable ai

  6. Deep Neural Networks are Easily Fooled

    research papers on explainable ai

VIDEO

  1. [EIRIC 세미나] Explainable AI for Machine Learning in Healthcare (이창희)

  2. Einstein AI: New AI That's Outsmarting Humans! AGI?

  3. TCF 2024: Effective decision making: Leveraging AI for growth and efficiency

  4. Open challenges and interdisciplinary research directions

  5. A historical perspective of Artificial Intelligence (AI) and its implications for our future

  6. Explainable AI

COMMENTS

  1. A systematic review of Explainable Artificial Intelligence models and applications: Recent developments and future trends

    The Artificial Intelligence algorithm was used to make the user take the decision in their business, but humans do not have any knowledge about the output of the AI or how it was reached. So, it is difficult for the users, to understand the output and process of the outcome. Hence, Explainable Artificial Intelligence (XAI) is being used.

  2. [2107.07045] Explainable AI: current status and future directions

    Explainable Artificial Intelligence (XAI) is an emerging area of research in the field of Artificial Intelligence (AI). XAI can explain how AI obtained a particular solution (e.g., classification or object detection) and can also answer other "wh" questions. This explainability is not possible in traditional AI. Explainability is essential for critical applications, such as defense, health ...

  3. Explainable Artificial Intelligence: A Review and Case Study on Model

    Explainable Artificial Intelligence (XAI) has emerged as an essential aspect of artificial intelligence (AI), aiming to impart transparency and interpretability to AI black-box models. With the recent rapid expansion of AI applications across diverse sectors, the need to explain and understand their outcomes becomes crucial, especially in critical domains. In this paper, we provide a ...

  4. Human‐centered explainable artificial intelligence: An Annual Review of

    Explainability became more important, and techniques and motivations for explanations became more diverse. The term "explainable AI" and the abbreviation "XAI" was coined in a paper about explanation in an AI-based military training video game (Van Lent et al., 2004). As a result, the field of explainable AI (XAI) was named and launched.

  5. Explainable artificial intelligence: a comprehensive review

    This issue has motivated the introduction of explainable artificial intelligence (XAI), which promotes AI algorithms that can show their internal process and explain how they made decisions. The number of XAI research has increased significantly in recent years, but there lacks a unified and comprehensive review of the latest XAI progress.

  6. Explainable artificial intelligence: a comprehensive review

    As a result, explainable artificial intelligence (XAI) was proposed to enhance the model transparency by proposing various methods that enable better model interpretability while maintaining the model performance ... Even though each research paper showed that an AI system achieved high performance in particular disciplines, ...

  7. PDF Explainable AI: current status and future directions

    This paper provides an overview of these techniques from a multimedia (i.e., text, image, audio, and video) point of view. Advantages and ... This explainability requirement lead a new area of AI research, know as Explainable AI (XAI). Figure 1 shows VOLUME 4, 2016 1 arXiv:2107.07045v1 [cs.LG] 12 Jul 2021.

  8. Explainable artificial intelligence: A survey

    In the light of these issues, explainable artificial intelligence (XAI) has become an area of interest in research community. This paper summarizes recent developments in XAI in supervised learning, starts a discussion on its connection with artificial general intelligence, and gives proposals for further research directions.

  9. Explainable AI: A Brief Survey on History, Research Areas ...

    This paper first introduces the history of Explainable AI, starting from expert systems and traditional machine learning approaches to the latest progress in the context of modern deep learning, and then describes the major research areas and the state-of-art approaches in recent years. The paper ends with a discussion on the challenges and ...

  10. Explainable AI: A Review of Machine Learning Interpretability Methods

    Abstract. Recent advances in artificial intelligence (AI) have led to its widespread industrial adoption, with machine learning systems demonstrating superhuman performance in a significant number of tasks. However, this surge in performance, has often been achieved through increased model complexity, turning such systems into "black box ...

  11. Survey on Explainable AI: From Approaches, Limitations and ...

    Research papers: The review will involve studying and synthesizing research papers that are relevant to the chosen scope. ... Explainable artificial intelligence: a comprehensive review. Artif Intell Rev. 2021;20:1-66. Google Scholar Chaddad A, Peng J, Xu J, Bouridane A. Survey of explainable AI techniques in healthcare. Sensors. 2023;23(2 ...

  12. The Role of Explainable AI in the Research Field of AI Ethics

    EC1: Most of the research papers in the field of AI ethics do not use empirical evidence. Only 23% of the papers provide empirical evidence. EC2: Empirical research on AI ethics grew significantly in 2018, corresponding with trends in public discourse. EC3: The most popular paper type in the research facet is a proposal for solving algorithmic ...

  13. Explainable artificial intelligence

    Machine learning (ML) is increasingly used to support decision-making in the healthcare sector. 2. Paper. Code. XAI refers to methods and techniques in the application of artificial intelligence (AI) such that the results of the solution can be understood by humans. It contrasts with the concept of the "black box" in machine learning where even ...

  14. Explainable AI (XAI): Explained

    The main objective of Explainable AI (XAI) research is to produce AI models that are easily interpretable and understandable by humans. In this view, this paper presents an overview of XAI and its techniques for creating interpretable models, specifically focusing on Local Interpretable Model-Agnostic Explanations (LIME) and SHapley Additive ...

  15. AI Fundamental Research

    Explainable AI is a key element of trustworthy AI and there is significant interest in explainable AI from stakeholders, communities, and areas across this multidisciplinary field. As part of NIST's efforts to provide foundational tools, guidance, and best practices for AI-related research, NIST released a draft white paper, Four Principles ...

  16. Human-annotated rationales and explainable text classification: a

    The collection and use of annotator rationales are surveyed to inspire the construction and evaluation of model-annotated rationales, which can play an important role in explainable artificial intelligence. Asking annotators to explain "why" they labeled an instance yields annotator rationales: natural language explanations that provide reasons for classifications. In this work, we survey ...

  17. [2111.06420] Explainable AI (XAI): A Systematic Meta-Survey of Current

    The past decade has seen significant progress in artificial intelligence (AI), which has resulted in algorithms being adopted for resolving a variety of problems. However, this success has been met by increasing model complexity and employing black-box AI models that lack transparency. In response to this need, Explainable AI (XAI) has been proposed to make AI more transparent and thus advance ...

  18. Sensors

    The recent advancements in autonomous driving come with the associated cybersecurity issue of compromising networks of autonomous vehicles (AVs), motivating the use of AI models for detecting anomalies on these networks. In this context, the usage of explainable AI (XAI) for explaining the behavior of these anomaly detection AI models is crucial. This work introduces a comprehensive framework ...

  19. [2404.09554] Explainable Generative AI (GenXAI): A Survey

    Explainable Generative AI (GenXAI): A Survey, Conceptualization, and Research Agenda. Generative AI (GenAI) marked a shift from AI being able to recognize to AI being able to generate solutions for a wide variety of tasks. As the generated solutions and applications become increasingly more complex and multi-faceted, novel needs, objectives ...

  20. (PDF) Explainable AI: A Brief Survey on History, Research Areas

    Explainable AI: A Brief Survey on History, Research Areas, Approaches and Challenges. September 2019. DOI: 10.1007/978-3-030-32236-6_51. In book: Natural Language Processing and Chinese Computing ...

  21. Papers with Code

    This paper introduces the front-propagation algorithm, a novel eXplainable AI (XAI) technique designed to elucidate the decision-making logic of deep neural networks. Unlike other popular explainability algorithms such as Integrated Gradients or Shapley Values, the proposed algorithm is able to extract an accurate and consistent linear function ...

  22. Here's what's really going on inside an LLM's neural network

    The company's new paper on "Extracting Interpretable Features from Claude 3 Sonnet" describes a powerful new method for at least partially explaining just how the model's millions of artificial ...

  23. Explainable Artificial Intelligence in Data Science

    A widespread need to explain the behavior and outcomes of AI-based systems has emerged, due to their ubiquitous presence. Thus, providing renewed momentum to the relatively new research area of eXplainable AI (XAI). Nowadays, the importance of XAI lies in the fact that the increasing control transference to this kind of system for decision making -or, at least, its use for assisting executive ...

  24. A novel method of detecting malware on Android mobile devices with

    This research paper will focus on a fascinating CNN-based malware categorization technique. ... [Show full abstract] the explainable artificial intelligence (XAI) systems, which achieve ...

  25. Artificial intelligence

    Conduct fundamental research to advance trustworthy AI technologies. Apply AI research and innovation across the NIST Laboratory Programs. Establish benchmarks, data and metrics to evaluate AI technologies. Lead and participate in development of technical AI standards. Contribute technical expertise to discussions and development of AI policies.

  26. The race to deploy generative AI and raise skills

    Amid tightening labor markets and a slowdown in productivity growth, Europe and the United States face shifts in labor demand, spurred not only by AI and automation but also by other trends, including efforts to achieve net-zero emissions, an aging population, infrastructure spending, technology investments, and growth in e-commerce, among others (see sidebar, "Our methodology").