Anthropic Accidentally Gives The World A Peek Into Its Model's 'Soul'

Artificial intelligence models don’t have souls, but one of them apparently has a “soul” document. A guy named Richard Weiss was able to get Anthropic’s latest big language model, Cloud 4.5 Opus, to create a document called the “Soul Overview” that was used to shape how the model interacts with users and presents its “personality.” Amanda Askell, a philosopher who serves on Anthropic’s technical staff, confirmed that the observations produced by the cloud are “based on an actual document” that is used to train the model.

In a post on Less Wrong, Weiss said he inspired the cloud for its system messaging, which is a set of conversation instructions given to the model by the people who trained it to inform the larger language model how to interact with users. In response, Cloud highlighted several alleged documents that were given to him, including a document titled “Soul_Overview”. Weiss asked the chatbot to tailor that document specifically, which resulted in Cloud spitting out an 11,000-word guide on how the LLM should proceed.

The document includes several references to security, attempting to put guardrails in place to prevent chatbots from producing potentially dangerous or harmful outputs. LLM is told by the document that “being genuinely helpful to humans is one of the most important things that Cloud can do for both Anthropic and the world,” and is forbidden from doing anything that would require him to “take actions that cross Anthropic’s ethical bright lines.”

Weiss has apparently made a habit of searching for these types of insights into how LLMs are trained and operate, and states in Less Wrong that it is not unusual for models to confuse documents when asked to generate system messages. (It doesn’t sound very good that the AI could create what it thinks it was trained to do, although who knows if its behavior is somehow influenced by the document it produced in response to a user’s prompt.) But the “soul observation” seemed legitimate to him, and his claim that he prompted the chatbot to reproduce the document 10 times, and it spat out exactly the same text in each instance.

Users on Reddit were also able to get Cloud to generate snippets of the same document with similar text, suggesting that LLM was pulling from something internally accessible in its training documents.

Turns out his instincts may have been right. On X, Escale confirmed that the cloud’s output is based on a document that was used during the model’s supervised learning period. “This is something I’ve been working on for a while, but it’s still being iterated on and we intend to release a full version and more details soon,” she wrote. Askell said, “Model extractions are not always completely accurate, but most are fairly faithful to the underlying document. This is known internally as ‘Soul Doc,’ which the cloud obviously picked up on, but it’s not a reflection of what we would call it.”

Gizmodo contacted Anthropic for comment on the document and its reproduction via the cloud, but did not receive a response at the time of publication.

The so-called soul of the cloud may just be some guidance to keep the chatbot from derailing, but it’s interesting to see that a user was able to get the chatbot to access that document and prepare it, and we actually get to see it. Very little of the AI model’s sausage-making has been made public, so it’s surprising to get a glimpse into the black box, even if the guidelines themselves seem pretty straightforward.

<a href

Anthropic Accidentally Gives the World a Peek Into Its Model’s ‘Soul’

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply