
OpenAI researchers are experimenting with a new approach to designing neural networks, which aims to make AI models easier to understand, debug, and control. Sparse models can provide enterprises with a better understanding of how these models make decisions.
Understanding how models choose to respond, a big selling point of reasoning models for enterprises, can provide a level of confidence for organizations when they turn to AI models for insights.
The method asked OpenAI scientists and researchers to observe and evaluate models not by analyzing post-training performance, but by adding interpretation or understanding through sparse circuits.
OpenAI notes that much of the opacity of AI models depends on how most models are designed, so to gain a better understanding of model behavior, they must create workarounds.
“Neural networks power today’s most capable AI systems, but they are difficult to understand,” OpenAI wrote in a blog post. “We don’t write these models with clear step-by-step instructions. Instead, they learn by adjusting billions of internal connections or weights until they master a task. We design the training rules, but not the specific behaviors that emerge, and the result is a dense web of connections that no human can easily understand.”
To enhance the interpretability of mixtures, OpenAI investigated an architecture that trains entangled neural networks, making them easier to understand. The team trained the language model with a similar architecture to existing models like GPT-2 using the same training schema.
Result: improved interpretability.
path towards explainability
OpenAI says that understanding how models work, giving us information about how they are making their determinations, is important because these have real-world implications.
The company defines interpretability as “methods that help us understand why a model produced a given output.” There are several ways to achieve interpretability: thought-chain interpretation, which logic models often take advantage of, and mechanistic interpretation, which involves reverse-engineering the mathematical structure of a model.
OpenAI focused on improving mechanistic explainability, which it said, “so far has been less immediately useful, but in principle, it could provide a more complete explanation of the model’s behavior.”
According to OpenAI, “By trying to explain model behavior at the most detailed level, mechanistic interpretation can make fewer assumptions and give us more confidence. But the path from low-level description to explanation of complex behavior is much longer and more difficult.”
Better interpretability allows better oversight and provides early warning signals if the model’s behavior no longer aligns with policy.
OpenAI noted that improving mechanistic explainability is “a very ambitious bet”, but research on sparse networks has improved it.
how to solve a model
To sort out the mess of connections made by a model, OpenAI first prunes most of these connections. Since transformer models like GPT-2 have thousands of connections, the team had to “zero out” these circuits. Each person will only talk to a select number, so connections will be more streamlined.
Next, the team ran “circuit tracing” on the tasks to create clusters of interpretable circuits. According to OpenAI, the final task involved pruning models to “obtain the smallest circuit that achieves the target loss on the target distribution”. It targeted a loss of 0.15 to isolate the exact nodes and weights responsible for the behavior.
The report states, “We show that pruning our weight-sparse model yields approximately 16 times smaller circuits on our tasks than pruning a dense model of comparable pre-training loss. We are also able to construct arbitrarily precise circuits at the expense of more edges. This shows that circuits for simple behavior are significantly more decomposed and localizable in the weight-sparse model than in the dense model.”
Small models become easier to train
Although OpenAI managed to create sparse models that are easy to understand, these are significantly smaller than most foundation models used by enterprises. Enterprises are increasingly using smaller models, but leading models, such as its flagship GPT-5.1, will still benefit from better interpretability.
Other model developers also aim to understand how their AI models think. Anthropic, which has been researching explainability for some time, recently revealed that it had “hacked” Cloud’s brain – and Cloud noticed. Meta is also working to explore how reasoning models make their decisions.
As more enterprises turn to AI models to help them make consequential decisions for their business and ultimately customers, research into understanding how models think will provide clarity many organizations need to trust models more.