Liberating from Correlations: An Interventional Benchmark for Concept-Based Explanations
Tue 18.11 11:00 - 11:30
- Graduate Student Seminar
-
Bloomfield 527
Abstract: Understanding the causal influence of high-level concepts on NLP model decisions is central to interpretability, yet current evaluation practices often fall short in capturing true causal effects. We introduce LIBERTy, a novel benchmark designed to rigorously evaluate concept-based explanation methods under controlled interventions. Unlike earlier resources that rely on simplified tasks and isolated edits, LIBERTy simulates realistic multi-domain scenarios, including CV screening, workplace violence prediction, and disease diagnosis, using LLM generated texts grounded in complex structured causal graphs. Each dataset is constructed from pairs of test examples and their counterfactuals, enabling fine grained quantitative measurement of each concept’s individual causal effect. We evaluate leading explanation techniques, such as Matching, LEACE, ConceptShap TCAV, and counterfactual generation, that explain multiple NLP models, including fine-tuned models and zero-shot LLMs. Our results reveal significant gaps in causal faithfulness across explanation methods and evaluated models, highlighting persistent limitations in their ability to accurately estimate causal effects.
