Anthropic has published what it calls “the most detailed map yet of a large language model’s internal representations,” using a technique called Sparse Autoencoders (SAEs) to identify and catalog individual “features” that fire in response to specific concepts.
The research team applied SAEs to Claude 3 and identified over 10 million distinct features, each corresponding to a specific concept — from “the Eiffel Tower” to “ironic sarcasm” to “the concept of recursion.” Critically, they demonstrated that many of these features are causally linked to model outputs, meaning they can be activated or suppressed to change the model’s behavior in predictable ways.
Why This Matters
AI interpretability has been one of the field’s hardest problems. Understanding why a model produces a specific output — rather than just predicting what it will produce — is essential for building AI systems that are safe, reliable, and aligned with human intentions.
This research suggests a path toward “surgery” on AI models: identifying and removing specific capabilities or biases by targeting the corresponding features. Anthropic demonstrated this by suppressing the model’s ability to generate harmful content while leaving general capabilities intact.
Limitations
The approach currently works best on Claude 3-scale models (around 100B parameters). Scaling to GPT-5-scale models (rumored 1T+ parameters) remains an open challenge. The researchers also note that some features are “polysemantic” — firing for multiple related but distinct concepts — which makes them harder to interpret.