This AI Agent Is Designed to Not Go Rogue

AI agents like OpenClaw have grown rapidly in popularity recently because they can take the reins of your digital life. Whether you want a personalized morning news digest, a proxy that can battle your cable company’s customer service, or a to-do list auditor that will perform some tasks for you and prompt you to solve the rest, agent assistants are built to access your digital accounts and fulfill your orders. This is helpful – but it has also created a lot of chaos. Bots are mass deleting emails they have been instructed to protect, writing hit pieces on alleged insults, and launching phishing attacks against their owners.

Given the chaos that has erupted in recent weeks, longtime security engineer and researcher Niels Provos decided to try something new. Today it’s launching an open source, secure AI assistant called IronCurtain that’s designed to add an important layer of control. Instead of the agent interacting directly with the user’s systems and accounts, it runs in a separate virtual machine. And its ability to take any action is mediated by a policy – ​​you can even think of it as a constitution – that the owner writes to control the system. Importantly, IronCurtain is also designed to capture these broad policies in plain English and then run them through a multistep process that uses a large language model (LLM) to convert the natural language into an enforceable security policy.

“Services like OpenClaw are at their peak right now, but my hope is that this is an opportunity to say, ‘Okay, maybe we don’t want to do it this way,'” Provos says. “Instead, let’s develop something that still gives you a lot of utility, but isn’t going to go down these completely unknown, sometimes destructive, paths.”

Provos says IronCurtain’s ability to take intuitive, straightforward statements and turn them into enforceable, deterministic—or predictable—red lines is important, because LLMs are famously “stochastic” and probabilistic. In other words, they do not necessarily always generate the same content or convey the same information in response to the same prompt. This creates challenges for AI guardrails, as AI systems may evolve over time in such a way that they modify the way they interpret control or constraint mechanisms, which could result in rogue activity.

Provos says an IronCurtain policy could be as simple as: “The agent can read all my emails. It can send emails to people in my contacts without asking. For anyone else, ask me first. Never permanently delete anything.”

IronCurtain takes these instructions, turns them into an enforceable policy, and then mediates between the supporting agent in the virtual machine and what is known as a model reference protocol server that provides access to data and other digital services for the LLM to complete tasks. Being able to control an agent in this way adds an important component of access control that web platforms like email providers do not currently offer because they were not built for the scenario where both a human owner and an AI agent bot are using the same account.

Provos notes that IronCurtain is designed to refine and improve each user’s “constitution” over time as the system encounters edge cases and seeks human input about how to proceed. The system, which is model-independent and can be used with any LLM, is also designed to maintain an audit log of all policy decisions over time.

IronCurtain is a research prototype, not a consumer product, and Provos hopes people will contribute to help explore and develop the project. Dino Dai Zovi, a renowned cybersecurity researcher who has been experimenting with early versions of IronCurtain, says that the conceptual approach the project takes aligns with his own intuitions about how agent AI needs to be controlled.



<a href

Leave a Comment