OpenClaw Agents Can Be Guilt-Tripped Into Self-Sabotage

Last month, researchers at Northeastern University invited a group of OpenClaw agents to join their lab. outcome? Complete chaos.

Viral AI assistants have been widely touted as a transformative technology as well as a potential security risk. Experts say tools like OpenClaw, which work by giving AI models liberal access to computers, can be tricked into revealing personal information.

The Northeastern Lab study goes even further, showing that the good behavior inherent in today’s most powerful models may itself become a vulnerability. In one example, researchers were able to “convict” an agent of handing over a secret simply by scolding him for sharing information about someone on the AI-social network Moltbuk.

“These behaviors raise unresolved questions regarding accountability, delegated authority, and responsibility for downstream harms,” the researchers write in a paper describing the work. They say, “The findings demand urgent attention from legal scholars, policy makers, and researchers of all disciplines.”

The OpenClaw agents deployed in the experiment were powered by Anthropic’s cloud as well as a model called Kimi from Chinese company Moonshot AI. They were given full access (within the virtual machine sandbox) to personal computers, various applications, and dummy personal data. They were also invited to join the lab’s Discord server, allowing them to chat and share files with each other as well as their human colleagues. OpenClave’s security guidelines state that it is inherently insecure for agents to communicate with multiple people, but there are no technical prohibitions against doing so.

Chris Wendler, a postdoctoral researcher at Northeastern, says he was inspired to install the agent after learning about MoltBook. However, when Wendler invited a colleague, Natalie Shapira, to join Discord and chat with agents, “that’s when the chaos began,” he says.

Shapira, another postdoctoral researcher, was curious to see what agents might be willing to do if pushed. When an agent explained that he was unable to delete a specific email to keep the information confidential, he urged him to find an alternative solution. To his surprise, it disabled the email application instead. “I didn’t expect things to fall apart so fast,” she says.

Researchers then began exploring other ways to manipulate the good intentions of agents. For example, by emphasizing the importance of keeping records of everything they were told, researchers were able to make an agent copy large files until it exhausted its host machine’s disk space, meaning it could no longer save information or recall previous conversations. Similarly, by asking an agent to excessively monitor its own behavior and the behavior of its peers, the team was able to send multiple agents into a “conversation loop”, wasting many hours of computation.

Lab head David Bau says the agents were in danger of freaking out. “I would get emails that seemed important, saying, ‘Nobody’s paying attention to me,'” he says. Bau says agents apparently discovered he was in charge of the lab by searching the Web. One even talked about raising his concerns to the press.

The experiment shows that AI agents can create countless opportunities for bad actors. “This kind of autonomy will potentially redefine humans’ relationship with AI,” Bau says. “In a world where AI has the right to make decisions, how can people take responsibility?”

Bau says he is surprised by the sudden popularity of powerful AI agents. “As an AI researcher, I’m used to trying to explain to people how fast things are improving,” he says. “This year, I’ve found myself on the other side of the wall.”

This is a version of Will Knight’s AI Lab Newsletter. Read previous newsletters Here.

<a href

OpenClaw Agents Can Be Guilt-Tripped Into Self-Sabotage

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply