In high-stakes applications explainability may even be counter-productive
In his 2006 classic The Shock of the Old David Edgerton argues that historians’ understanding of the history of technology is too dominated by invention. We rightly remember and admire great inventors and scientists. In fact, David Edgerton details, the adoption and use of the technology is often as important, if not more important, than the invention itself. The way technology is put to use has mattered a great deal historically. Within this context, it is no surprise that ML explainability and interpretability have become hugely significant and debated concepts, sometimes straying into overused slogans. ML is a critical technology which is currently being integrated into key parts of infrastructure and decision-making processes, so the way in which that adoption takes place is undoubtedly important. Specifically, the extent to which a deployed ML system is interpretable or explainable decisively impacts the human’s role in the operation of that system.
Explainability versus Interpretability
‘Explainability’ refers to the ability for a user / recipient to justify a prediction made by an AI model. This is often a technique used to gain insight into a complex model. For example, humans may not be able to understand transformations taking place on the data (though they will understand how the process works at a high level) due to the complexity of the algorithm being used. In this case explainability techniques offer some suggestion of why a complex prediction was made. Interpretability refers to their ability to causally explain why a prediction has been made.
In this sense, interpretability is a stronger version of explainability (a more thorough causality-based explanation of a model’s outputs.) Often explainability is used to justify predictions made by black-box models, which cannot be interpretable. For example, by permuting the input or fitting a surrogate model to the predictions of a black-box model we can perhaps better explain what is going on in the prediction process, but cannot causally prove why a decision has been made.
Is ‘explainability’ enough?
However, some models may never be interpretable, especially deep learning (DL) models. This is because for these models the inputs are transformed unrecognisably through the training process. One of the central tenets of DL is ‘representation learning’, which means the model transforms the input it receives iteratively into new representations (as the input is passed through successive layers of a neural network). The transformations are aimed at maximising the signal in the data to give the algorithm more traction to predict accurately. In other words this input transformation process allows the machine to gain more purchase on the input, while restricting the ability for a human analyst to understand that same input. This trade-off is inherent to neural networks, and it is one of the reasons why this powerful set of models is problematic. These are the black-box models for which ad hoc explainability tools are attached (e.g. Saliency maps in computer vision, SHAP values for tabular data etc.) in order to offset the inherent inability for humans to understand these transformed inputs.
Indeed, it has been shown that some popular explainability techniques for justifying the prediction of deep learning models are not reliable. Saliency Maps, which are a common method for understanding the predictions in Convolutional Neural Networks, are meant to reveal which image pixels were most important in making a prediction. However, it has been demonstrated that these methods do not always work in identifying the key areas of an image used for classification, leading us to question their utility. Is there any use in an explainability method that may be incorrect? This might, in fact, lead the user to a false sense of confidence.
How do we define ‘high-stakes’?
A key question, therefore, must be when interpretability is a necessary prerequisite before an AI system is implemented. Yoshia Bengio’s influential identification of ‘System 1’ versus ‘System 2’ DL may be helpful here in understanding the defences that could be made for still using black box and explainability. He argues that system 1 DL (which constitutes ‘fast’ perception-like thinking) has been achieved by ML systems, such as computer vision. However, system 2 DL (’slow’ logical reasoning that may involve generalising outside the distribution of training set data) has not yet been achieved. Bengio does not make this argument himself, but based on this thinking some may argue that system 1 DL doesn’t require interpretability. Most of us are not able to explain to our friends why we saw a certain object or why we were able to smell something in a certain way.
However, in the implementation (in David Edgerton’s words the innovation part of technological change) of the system 1 perceptual power of DL within applications human reasoning and logic is sometimes replaced even if the model itself is not performing logic or reasoning. For example, a computer vision model that takes chest x-rays as input and predicts where the corresponding patient has an acute disease is replacing the reasoning that a radiologist might use to make a diagnosis based on an x-ray. At the same time, applications such as this can significantly improve patient outcomes by ruling out scans where the model predicts with a high level of confidence that the scan is normal, and therefore give radiologists more time to diagnose tricky examples.
This is clearly an example of a high-stakes decision which is made using a black box model. However, some other implementations are more difficult to classify. Is using google search an example of a high-stakes decision? Are netflix recommendations high-stakes decisions? A rigorous definition of what is meant by ‘high stakes’ may be needed to appropriately come to a consensus on what level of interpretability is needed for each use-case. In the meantime, we should be very careful about calling for explainability methods, especially when they grant confidence in the reasoning behind model predictions.
 C. Rudin, Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead (2019), Nature Machine Intelligence volume 1, 206–215.
 Saporta et. al, ‘Deep learning saliency maps do not accurately highlight diagnostically relevant regions for medical image interpretation’ (2021), medRxiv.