The seminar is an international online event focused on exploring the theoretical foundations of interpretable and explainable AI. Its goal is to exchange ideas and form a supportive community for those interested in the topic.
Practicalities
- Monthly seminar, 16.00 Central European Time (CET) /
10.00 am Eastern Standard Time (EST)
NB: One hour later than before! - Zoom link: https://uva-live.zoom.us/j/87120549999
- Sign up:
- YouTube channel
Organizers:
Schedule
2024/2025
February 4 | Jeremias Sulam | Testing Semantic Importance via Conditional Independence and Betting |
AbstractIn recent years, the demand for interpretable machine learning models has surged, particularly in high-stakes domains, and a growing collection of methods exist to attempt to ‘explain’ off-the-shelf predictors. Yet, it often remains unclear how to interpret the reported importance of features or concepts. This talk reexamines interpretability through the lens of conditional independence tests, offering answers with precise statistical guarantees, including control over type 1 error and false discovery rate. We introduce methods based on online testing—or testing by betting—to rank importance across both input features and semantic constructs, and explore connections to game-theoretic approaches for explanation. |
|
January 9 | Yishay Mansour | A Theory of Interpretable Approximations |
AbstractCan a deep neural network be approximated by a small decision tree based on simple features? This question and its variants are behind the growing demand for machine learning models that are *interpretable* by humans. In this work we study such questions by introducing *interpretable approximations*, a notion that captures the idea of approximating a target concept c by a small aggregation of concepts from some base class H. In particular, we consider the approximation of a binary concept c by decision trees based on a simple class H (e.g., of bounded VC dimension), and use the tree depth as a measure of complexity. Our primary contribution is the following remarkable trichotomy. For any given pair of H and c, exactly one of these cases holds: (i) c cannot be approximated by H with arbitrary accuracy; (ii) c can be approximated by H with arbitrary accuracy, but there exists no universal rate that bounds the complexity of the approximations as a function of the accuracy; or (iii) there exists a constant k that depends only on H and c such that, for *any* data distribution and *any* desired accuracy level, c can be approximated by H with a complexity not exceeding k. This taxonomy stands in stark contrast to the landscape of supervised classification, which offers a complex array of distribution-free and universally learnable scenarios. We show that, in the case of interpretable approximations, even a slightly nontrivial a-priori guarantee on the complexity of approximations implies approximations with constant (distribution-free and accuracy-free) complexity. We extend our trichotomy to classes H of unbounded VC dimension and give characterizations of interpretability based on the algebra generated by H. |
|
December 5 | Depen Morwani | Feature emergence via margin maximization: case studies in algebraic tasks |
AbstractUnderstanding the internal representations learned by neural networks is a cornerstone challenge in the science of machine learning. While there have been significant recent strides in some cases towards understanding how neural networks implement specific target functions, this paper explores a complementary question -- why do networks arrive at particular computational strategies? Our inquiry focuses on the algebraic learning tasks of modular addition, sparse parities, and finite group operations. Our primary theoretical findings analytically characterize the features learned by stylized neural networks for these algebraic tasks. Notably, our main technique demonstrates how the principle of margin maximization alone can be used to fully specify the features learned by the network. Specifically, we prove that the trained networks utilize Fourier features to perform modular addition and employ features corresponding to irreducible group-theoretic representations to perform compositions in general groups, aligning closely with the empirical observations of Nanda et al. and Chughtai et al. More generally, we hope our techniques can help to foster a deeper understanding of why neural networks adopt specific computational strategies. |
|
November 18 | Andrej Risteski | Discernible patterns in trained Transformers: a view from simple linguistic sandboxes |
AbstractInspecting learned models for intelligible patterns has
become a common way to try to reverse-engineer the algorithm
that a model is implementing. In this talk, I will present two
case studies, in which we analyze the patterns that emerge when
training Transformer-based models on data isolating simple
linguistic abstractions of semantics and syntax: topic models
and context-free grammars. Concretely, topic models are a simple
bag-of-words model in which co-occurrence patterns of words
capture a simple notion of semantic correlation; context-free
grammars are a formal language encoding parsing structure
induced by grammatical rules. In the former case, we show that
simple, one-layer attention Transformers trained with gradient
descent learn to encode the co-occurrence structure in a natural
way in the attention patterns, as well as the value matrix. In
the latter case, we show that the set of optima of the standard
autoregressive loss---even for very simple Transformer models,
even with "natural" choices of token embeddings --- is
qualitatively rich. In particular, the attention pattern of a
single layer can be “nearly randomized”, while preserving the
functionality of the network --- rendering "myopic" methods of
inspecting individual heads or weight matrices in the
Transformer misleading.
|
Video recording |
October 17 | Robi Bhattacharjee | Auditing Local Explanations is Hard |
AbstractIn sensitive contexts, providers of machine learning algorithms are increasingly required to give explanations for their algorithms' decisions. However, explanation receivers might not trust the provider, who potentially could output misleading or manipulated explanations. In this work, we investigate an auditing framework in which a third-party auditor or a collective of users attempts to sanity-check explanations: they can query model decisions and the corresponding local explanations, pool all the information received, and then check for basic consistency properties. We prove upper and lower bounds on the amount of queries that are needed for an auditor to succeed within this framework. Our results show that successful auditing requires a potentially exorbitant number of queries -- particularly in high dimensional cases. Our analysis also reveals that a key property is the ``locality'' of the provided explanations -- a quantity that so far has not been paid much attention to in the explainability literature. Looking forward, our results suggest that for complex high-dimensional settings, merely providing a pointwise prediction and explanation could be insufficient, as there is no way for the users to verify that the provided explanations are not completely made-up. |
Video recording |
September 5 | Lesia Semenova | On the existence of simpler-yet-accurate machine learning models |
AbstractIn high-stakes decision domains such as healthcare, lending, and criminal justice, the predictions made by deployed AI systems can significantly impact human lives. Understanding why models make specific predictions is as crucial as ensuring their good performance, making interpretability a key component of a trustworthy decision-making process. However, there has been a longstanding belief in the community that there is a trade-off between accuracy and interpretability. In this talk, I will formally demonstrate that such a trade-off does not hold for many datasets in high-stakes decision domains and that simpler models often perform as well as black-box models. Specifically, I will present a mechanism of the data generation process, combined with the choices typically made by analysts during the learning process, that leads to the existence of simpler-yet-accurate models. Bio: Lesia Semenova is a postdoctoral researcher at Microsoft Research, NYC. She completed her PhD at Duke University and, prior to that, worked at the Samsung R&D Institute Ukraine. Her research interests span interpretable machine learning, responsible and trustworthy AI, reinforcement learning, reasoning, and AI in healthcare. The student teams she has coached won the ASA Data Challenge Expo twice and placed third in a competition on scholarly document processing. She was selected as one of the 2024 Rising Stars in Computational and Data Sciences. |
Video recording |
2023/2024
July 11 | Sanjoy Dasgupta | Recent progress on interpretable clustering |
AbstractThe widely-used k-means procedure returns k clusters that have arbitrary convex shapes. In high dimension, such a clustering might not be easy to understand. A more interpretable alternative is to constrain the clusters to be the leaves of a decision tree with axis-parallel splits; then each cluster is a hyper-rectangle given by a small number of features. Is it always possible to find clusterings that are intepretable in this sense and yet have k-means cost that is close to the unconstrained optimum? A recent line of work has answered this in the affirmative and moreover shown that these interpretable clusterings are easy to construct. I will give a survey of these results: algorithms, methods of analysis, and open problems. |
Video recording |
June 20 | Blair Bilodeau | Impossibility Theorems for Feature Attribution |
AbstractDespite a sea of interpretability methods that can produce plausible explanations, the field has also empirically seen many failure cases of such methods. In light of these results, it remains unclear for practitioners how to use these methods and choose between them in a principled way. In this paper, we show that for moderately rich model classes (easily satisfied by neural networks), any feature attribution method that is complete and linear—for example, Integrated Gradients and SHAP—can provably fail to improve on random guessing for inferring model behaviour. Our results apply to common end-tasks such as characterizing local model behaviour, identifying spurious features, and algorithmic recourse. One takeaway from our work is the importance of concretely defining end-tasks: once such an end-task is defined, a simple and direct approach of repeated model evaluations can outperform many other complex feature attribution methods. Paper: https://arxiv.org/abs/2212.11870 |
Video recording |
May 7 | Hidde Fokkema | Attribution-based Explanations that Provide Recourse Cannot be Robust |
AbstractSince most machine learning systems are not inherently
interpretable, a class of explainable machine learning methods
try to attribute importance of the input features to the outcome
of the model. We show that two often proposed requirements of
good attribution-based explanations are actually mathematically
incompatible. The first requirement is to provide recourse to
users: if the user is unhappy with the decision, the explanation
should tell them what they would need to change to improve the
decision. The second requirement is robustness: small changes in
a user's features (e.g. due to rounding or measurement errors)
should not cause large changes in the explanations. We show that
no method can always provide recourse and be robust, even though
both properties can be guaranteed individually. For some
restricted set of models, it is still possible for an
attribution method to be robust and provide recourse and I will
discuss some examples where this occurs. However, the message
will be that these classes are often simple enough that they do
not warrant an explanation. I will further illustrate our
findings with counterexamples to at least one of the
requirements for popular explanation methods like SHAP, LIME,
Integrated Gradients and SmoothGrad.
|
Video recording |
April 4 | Damien Garreau | A Sea of Words: An In-Depth Analysis of Anchors for Text Data |
Abstract
Anchors (Ribeiro et al., 2018) is a post-hoc, rule-based
interpretability method. For text data, it proposes to explain a
decision by highlighting a small set of words (an anchor) such
that the model to explain has similar outputs when they are
present in a document. In this talk, I will present a first
attempt to theoretically understand Anchors, considering that
the search for the best anchor is exhaustive. I will give
explicit results on shortcut models and linear models when the
vectorization step is TF-IDF, and word replacement is a fixed
out-of-dictionary token.
|