Understanding and Enhancing Deep Neural Networks with Automated Interpretability
Sun 02.02 10:30 - 11:30
- Faculty Seminar
-
Bloomfield 527
Abstract:
Deep neural networks are becoming incredibly sophisticated; they can generate realistic images, engage in complex dialogues, analyze intricate data, and execute tasks that appear almost human-like. But how do such models achieve these abilities?
In this talk, I will present a line of work that aims to explain the behaviors of deep neural networks. This includes a new approach for evaluating cross-domain knowledge encoded in generative models, tools for uncovering core mechanisms in large language models, and their behavior under fine-tuning. I will show how to automate and scale the scientific process of interpreting neural networks with the Automated Interpretability Agent, a system that autonomously designs experiments on models’ internal representations to explain their behaviors. I will demonstrate how such understanding enables mitigating biases and enhancing models’ performance. The talk will conclude with a discussion of future directions, including developing universal interpretability tools and extending interpretability methods to automate scientific discovery.