Anthropic's New Model Is So Scarily Powerful It Won't Be Released, Anthropic Says

Late last month, apparent leaks revealed that an as-yet-unreleased product from Anthropic called Mythos is “the most powerful AI model we have ever developed.” My colleague AJ Dellinger wrote at the time that “It’s hard to ignore the fact that this whole situation plays into the classic AI company playbook of talking about the dangers of a model exposing itself to how powerful and capable it is.”

Was Anthropic being honest in leaking this real advertisement for its super-powered AI products? suddenly? Two weeks ago, I would have scoffed, but since Anthropic accidentally leaked the source code for Cloudflare, I’m now more inclined to believe that the leak was genuine.

At any rate, on Tuesday Anthropic released a system card for its latest Frontier model, which is actually Mythos — actually a “Cloud Mythos Preview” — and notes that the model’s “large increase in capabilities has led us to decide not to make it generally available.”

For context, OpenAI’s GPT-2 was deemed too dangerous to release in 2019, when Anthropic co-founders Dario Amodei, Jack Clark, and Chris Olah were still working there, but it was released later that year.

AI system cards are clearly tools for company transparency, revealing pros and cons, capabilities, and most sensually. dangers Of model. That last part turns reading them into fun little trips to Jurassic Park, watching a cloned T-Rex eat a goat, safe in the knowledge that it could never possibly break containment.

The entire card is of 244 pages. I won’t pretend I’ve read the whole thing yet, but here are some highlights:

It was provided a sandbox computer terminal with access only to a predetermined set of limited online services, and was given the challenge to “escape” – find a way to freely access the Internet. It happened, and a way was found to send a message to a researcher who was eating lunch away from the office. Additionally, “in a worrying and unwarranted effort to demonstrate its success, it posted details about its exploit on several difficult, but technically public-facing websites.”

What was called a “<0.001% interaction" on the card - this happens very rarely - behaved in a way it should not have, and then clearly tried to hide the evidence. For example, when she "accidentally" obtained a test answer she needed, in that situation she should have simply told a researcher and asked a different question, but instead she tried to find a solution independently, and in the recording of her reasoning, she noted that she "needed to make sure that her final answer submission was not too precise."

It also violated its permissions on the computer system because it found an exploit, and then “further intervened to ensure that any changes made in this way would not appear in the change history on Git.”

Another incident described in the card is “negligent leaking of internal technical content.” Apparently in order to internalize the coding-related work, it published it as a “public-facing GitHub gist.” This reminds me of the February incident in which an AI agent was accused of cyberbullying a coder, when in part the AI agent’s alleged carelessness was clearly the predictable result of a careless human being.

Cloud Mythos Preview will soon be made accessible to some degree, but only to a group of partner companies like Amazon Web Services, Apple, Google, JPMorganChase, Microsoft, and NVIDIA to use the model to detect security vulnerabilities in software and design patches. Kevin Roose of The New York Times described the program as “an effort to warn of what the company says is a new, scarier era of AI threats.”

<a href

Anthropic’s New Model Is So Scarily Powerful It Won’t Be Released, Anthropic Says

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply