publications
2025
- Extracting Interpretable Task-Specific Circuits from Large Language Models for Faster InferenceAAAI Conference on Artificial Intelligence (In Press), 2025
2024
- How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic InterpretabilityIn International Conference on Artificial Intelligence and Statistics (AISTATS), 2024
- Detecting and understanding vulnerabilities in language models via mechanistic interpretabilityIn International Joint Conference on Artificial Intelligence (IJCAI), 2024