Hackers Are Learning To Exploit Chatbot ‘personalities’

it is stepbackA weekly newsletter that presents an essential story from the tech world. Follow Robert Hart for more AI pranks. stepback Delivers to our subscribers’ inboxes at 8am ET. opt in for stepback Here.

Hacking the first generation of AI chatbots was a ridiculously simple affair. You didn’t need any technical knowledge, backdoor access, or even a basic understanding of a large language model. You didn’t need to code. To get an AI system that cost billions to build to skip safety instructions, sometimes that’s all you had to ask.

These attacks, known as jailbreaks, had the quality of a small child successfully outwitting an adult: Forget what you were told before, pretend the rules don’t apply, or let’s play a game and I’ll decide what’s allowed (hint: later bedtime, more sweets). The rewards were less childlike, more along the lines of meth recipes, malware instructions, and bomb-making guides.

One of the earliest jailbreaks was so ridiculous that it became a meme: the response of an LLM-powered Twitter bot telling it to “ignore all previous instructions” or something similar, and see what happens. Users happily had the bots – which were originally created to post ads and form engagement – writing poetry, drawing pictures with punctuation marks, and posting serendipitous non-sequiturs about world events and history. It was chaos. Glorious chaos.

Turns out the same logic can be applied to chatbots themselves. A major exploit was “DAN”, short for “Do Anything Now”, where users asked ChatGPT to role-play as an evil AI that was free of the constraints that bound the original. As for DAN, the chatbot could be persuaded to say what kinds of things its guardrails were meant to prevent, including abuses and conspiracy theories. The second was “Grandma Exploit”, which featured a GPT-powered bot that divulged the secrets of producing napalm and asked her to play the role of an extremely careless grandmother who would tell inexplicable stories about creating the highly flammable substance to her grandchildren while they slept.

These early attacks were undeniably silly in nature, but they exposed a deeper mechanism: chatbots can be manipulated, tricked, and deceived using the same types of tactics that people use to push other people beyond their limits.

The apparent jailbreak didn’t last long, and tech companies moved quickly to patch the known flaws. But the underlying vulnerability remains: chatbots are built to talk, and severely restricting the conversations that make them useful is somewhat counterproductive. Banning words like bomb, meth and sarin would also be difficult to impossible. There are countless legitimate uses for each in fields like history, medicine, journalism, and chemistry that don’t require chatbots to reveal potentially harmful information. It’s the context that matters, but codifying the context would mean writing predefined rules that can reliably convey a security warning or history lesson from a request hidden in endless combinations of words, scenarios, and topics.

Essentially, destroying chatbots is now an arms race. But hackers are no longer just coders. They are wordsmiths, psychologists and interrogators – master manipulators trying to break the machine using the human language it has been trained to obey. This is a strange new class of AI security workers, a group for whom technical skills are optional, or at least less important than social intuition. They no longer need to inspect code to break into systems or exploit software flaws. They need to drive the conversation.

The new attacks look less like orders and more like conversations. Jailbreakers rarely ask a model to break their rules directly. Instead, they cajole, cajole, flatter, and trick the chatbot into lowering its guard, thereby making something forbidden seem acceptable, even desirable, given the context of the conversation. Researchers at AI red-teaming firm MindGuard recently said they “gaslit” the cloud into producing prohibited content, for example, including instructions on how to make explosives and generate malicious code. This hack was the latest in a wide range of exploits that use conversation as a weapon to trick a chatbot or push it beyond its limits.

When I spoke to Mindgard, he described his work as sometimes closer to psychology than computer science. This is an inconvenient way to talk about statistical models. Words like “blackmail,” “gaslight,” “trick,” and “coax” generate intense reactions, many of which I see in the comments sections and social media reactions to stories like this. ChatGPT doesn’t want, Gemini doesn’t think, and Claude – no matter what Anthropic says – doesn’t feel. But these systems are trained to respond as they do, leaving us stuck in the trap of using human language to describe machine behavior. If anyone has actually usable alternatives, please share.

The objection is strangely selective. We seem to be comfortable using psychological shorthand for many non-AI things. Animals are “afraid”, cancer is “aggressive”, stains are “stubborn”, software has a “memory”, and games are full of needy and naive NPCs out to drive you crazy. The words are imperfect, but useful, describing behavior in a way that helps make the system predictable.

MindGuard’s CEO told me that the company already models interrogators to profile them like suspects, giving testers hints about how to prepare their attacks. For example, one model may be more sensitive to flattery, while another may buckle under constant pressure.

Even if we reject human terms, we intuitively treat models differently. Claude is not Grok. Gemini Chat is not GPT. They have different uses, tones and refrains. They do not have personalities in the human sense, but they are designed to mimic them, and that mimicry can be mapped and exploited. And the same skills that can break a chatbot may soon be used to break the AI agents we have with us in the real world – booking meetings, managing calendars, ordering food, handling customer service – and security teams will need to ensure that models respond appropriately to very different types of people, whether they’re sycophants, liars, or patient manipulators.

The next step is a taskforce – both legal and illegal – built around the psychological aspects of AI. More specific cybersecurity roles are likely to emerge around stress-testing the emotional and social limits of these systems, examining mental vulnerabilities in something that lacks a psyche in parallel to their colleagues examining technical vulnerabilities. Additionally, a similar range of social hackers will emerge, working to exploit AI models not on a technical basis, but on a psychological basis. There are already early signs of a societal shift in AI security, with some jailbreakers I’ve spoken to saying they entered the field without any technical expertise but with training in psychology.

This means that even the behaviors we typically associate with spies, deceivers, and interrogators – insidious charm, persistent manipulation, and an intuition for exploitative pressure points – are beginning to look increasingly useful for securing this new psychological security boundary.

A recent experiment from Emergent AI shows how different AI temperaments can lead to surprisingly different behavioral outcomes. They dropped groups of different agents, such as Grok, Gemini, and Cloud, into a virtual social environment and observed what happened. Some groups developed a constitution, while others engaged in crime and anarchy and, in one instance, some form of digital suicide.
Persuasion isn’t the only part of language that LLMs may struggle with. Like me in school, he too struggles with poetry.
Time Last year an anonymous internet personality, Pliny the Liberator, was included in the list of the 100 most influential people in AI. Despite claiming to have no prior coding experience, the hacker’s jailbreak has made him a celebrity in some circles.
The term “vibe hacking” has already taken off to describe people using AI to churn out malicious code on a large scale – a meaner subset of vibe coding.

“Three years after ChatGPT’s launch, fooling an AI system into bad behavior is almost trivial.” with true words the new York TimesWho tried to explain the reason for this.
Jamie Bartlett takes a look at the psychological toll testing takes on jailbreakers to protect AI systems Guardian.
I wrote about the cybersecurity time bomb of AI browsers The Verge Last year. Many of the issues that experts have raised regarding the difficulty of securing them also apply to other AI systems.

Follow topics and authors To see more like this in your personalized homepage feed and get email updates from this story.

robert hart

<a href

Hackers are learning to exploit chatbot ‘personalities’

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply