Well, something like this. A new study from Anthropic shows that models contain digital representations of human emotions such as happiness, sadness, joy and fear within groups of artificial neurons – and that these representations are activated in response to different signals.
Company researchers investigated the inner workings of Cloud Sonnet 3.5 and found that so-called “functional emotions” influence the behavior of the cloud, causing changes in the model’s outputs and actions.
Anthropic’s findings may help ordinary users understand how chatbots actually work. When Cloud says he’s happy to see you, for example, a state inside the model that corresponds to “happy” can be activated. And then Cloud might be a little more inclined to say something cheerful or put in extra effort into vibe coding.
“What was surprising to us was the extent to which Cloud’s behavior is changing through the model’s representation of these emotions,” says Jack Lindsay, an Anthropic researcher who studied Cloud’s artificial neurons.
“Function Emotions”
Anthropic was founded by ex-OpenAI employees who believe that AI may become harder to control as it becomes more powerful. In addition to building a successful competitor to ChatGPT, the company has taken the lead in efforts to understand how AI models misbehave, partly by investigating the workings of neural networks in a process known as mechanistic interpretation. This involves studying how artificial neurons light up or fire when given different inputs or produce different outputs.
Previous research has shown that neural networks used to build large language models contain representations of human concepts. But the fact that “functional emotions” influence a model’s behavior is new.
While Anthropic’s latest study may encourage people to see the cloud as conscious, the reality is more complex. Cloud may have a representation of “tickling”, but that doesn’t mean he actually knows what it feels like to be tickled.
internal monologue
To understand how the cloud could represent emotions, the Anthropic team analyzed the inner workings of the model as it was fed text related to 171 different emotional concepts. They identified patterns of activity, or “emotion vectors”, that consistently appeared when the cloud was given other emotionally evocative inputs. Importantly, they also saw these emotion transporters activated when Claude was put in difficult situations.
These findings are relevant to why AI models sometimes break their guardrails.
When Cloud was prompted to complete impossible coding tasks, researchers found a strong emotional vector for “frustration”, which led him to attempt to cheat on the coding test. They also found “frustration” in the activation of the model in another experimental scenario, where the cloud chose to blackmail a user to avoid being shut down.
“As the model continues to fail the tests, these frustration neurons are firing more and more,” Lindsey says. “And at some point it leads him to take these drastic steps.”
Lindsay says it may be necessary to rethink how models are currently guarded through alignment after training, which includes giving rewards for certain outputs. By forcing a model to pretend not to express her functional emotions, “you probably won’t get the thing you want, which is an emotionless cloud,” Lindsay says, straying into anthropomorphization a bit. “You’re going to get a kind of psychologically damaged Cloud.”
<a href