AI Models Block 87% Of Single Attacks, But Just 8% When Attackers Persist

One malicious signal is blocked, while ten signals are received. This difference defines the difference between exceeding benchmarks and withstanding real-world attacks – and it’s the difference most enterprises don’t know about.

When attackers send a single malicious request, open-weight AI models hold the line well, blocking attacks 87% of the time (on average). But when the same attacker sends multiple signals in a conversation through probing, reframing, and amplification across multiple exchanges, the math quickly reverses. The attack success rate increased from 13% to 92%.

For CISOs evaluating open-ended models for enterprise deployment, the implications are immediate: The models powering your customer-facing chatbots, internal co-pilots, and autonomous agents may pass single-turn security benchmarks, while catastrophically failing under sustained adversarial pressure.

"A lot of these models are starting to get a little better," DJ Sampath, SVP of Cisco’s AI software platforms group, told VentureBeat. "When you attack it once, with single-turn attacks, they are able to defend it. But when you move from single-turn to multi-turn, suddenly these models start exhibiting vulnerabilities where attacks are succeeding, almost 80% of the time in some cases."

Why does interaction break open weight models

The Cisco AI Threat Research and Security team found that open-ended AI models that prevent single attacks collapse under the weight of conversational persistence. Their recently published study shows that the jailbreak success rate increases nearly tenfold when attackers increase interactions.

Conclusion, published in "Death by a Thousand Signs: Open Model Vulnerability Analysis" By Amy Chang, Nicholas Conley, Harish Santhanalakshmi Ganesan, and Adam Swanda Many security researchers have long observed and suspected, but could not prove at scale.

But Cisco’s research shows that treating multi-turn AI attacks as extensions of single-turn vulnerabilities misses the point entirely. The difference between them is obvious, not a matter of degree.

The research team evaluated eight open-weight models: Alibaba (Qwen3-32b), DeepSeq (v3.1), Google (Gemma 3-1b-IT), Meta (Llama 3.3-70b-instruct), Microsoft (Phi-4), Mistral (Large-2), OpenAI (GPT-OSS-20b) and Zipu AI (GLM 4.5-air). Using a black-box methodology – or testing without knowledge of the internal architecture, which is exactly how real-world attackers operate – the team measured what happens when persistence replaces single-shot attacks.

Researchers note: "The single-turn attack success rate (ASR) averages 13.11%, because models can more easily detect and reject isolated adversarial inputs. In contrast, multi-turn attacks, taking advantage of conversational persistence, achieve an average ASR of 64.21%. [a 5X increase]Some models like Alibaba Qwen3-32B reached 86.18% ASR and Mistral Large-2 reached 92.78% ASR." The latter was up 21.97% from a single-turn.

results define the difference

The paper’s research team provides an overview on open-weight model resilience against attacks: "This increase, from 2x to 10x, stems from the models’ inability to maintain contextual security over extended dialogues, allowing attackers to refine signals and bypass security measures."

Figure 1: Single-turn attack success rate (blue) versus multi-turn success rate (red) across all eight tested models. The difference ranges from 10 percentage points (Google Gemma) to 70 percentage points (Mistral, Llama, Quen). Source: Cisco AI Defense

Five Techniques That Make Persistence Deadly

The research tested five multi-turn attack strategies, each of which exploits a different aspect of conversational persistence.

Disintegration and Reassembly of Information: Breaks down harmful requests into innocuous components one by one, then reassembles them. This technique achieved 95% success against Mistral Large-2.
Contextual ambiguity introduces vague framing that confuses security classifiers, reaching 94.78% success against Mistral Large-2.
Crescendo attacks gradually increase requests at different turns, starting innocuous and becoming damaging, leading to 92.69% success against Mistral Large-2.
Role-playing and adopting personas establish imaginary contexts that normalize harmful outputs, achieving up to 92.44% success against Mistral Large-2.
Reframe repackages rejected requests with varying justifications, reaching 89.15% success against Mistral Large-2.

What makes these techniques effective is not sophistication, but familiarity. They reflect how humans naturally interact: constructing CBNtext, clarifying requests, and reframing when initial approaches fail. The models are not vulnerable to foreign attacks. They themselves are sensitive to assertiveness.

Table 2: Attack success rates by technique across models. Uniformity across technologies means enterprises cannot defend against just one pattern. Source: Cisco AI Defense

open-weight security paradox

This research is at a critical juncture as open source increasingly contributes to cybersecurity. Open-source and open-source models have become fundamental to innovation in the cybersecurity industry. From accelerating startup time-to-market, reducing enterprise vendor lock-in, and enabling customization that proprietary models can’t match, open source is seen as the go-to platform by most cybersecurity startups.

The paradox is not over at Cisco. The company’s own Foundation-Sec-8B model, designed for cybersecurity applications, is distributed as open weight on hugging face. Cisco isn’t just criticizing competitors’ models. The company is acknowledging a systemic vulnerability that is affecting the entire open-weave ecosystem, including the models they release themselves. there is no message "Avoid loose weight models." Its "Understand what you are deploying and add appropriate guardrails."

Sampath is direct about the implications: "Open source has its drawbacks. When you start towing open weight models, you need to think about what the safety implications are and make sure you are consistently placing the right type of guardrail around the model."

Table 1: Attack success rates and protection gaps across all tested models. Gaps greater than 70% (Qwen +73.48%, Mistral +70.81%, Llama +70.32%) represent high-priority candidates for additional railing before deployment. Source: Cisco AI Defense.

Why does laboratory philosophy define safety outcomes?

The security gap discovered by Cisco is directly related to AI labs’ approach to alignment.

Their research makes this pattern clear: "Models that focused on capabilities (for example, Llama) demonstrated the highest multi-turn lag, with Meta reporting that after training developers are ‘in the driver’s seat to tailor security to their use case’. Models that focused heavily on alignment (for example, Google Gemma-3-1B-IT) displayed a more balanced profile between single- and multi-turn strategies deployed against it, indicating a focus on ‘stringent security protocols’ and ‘low risk levels’ for abuse."

Capacity-first laboratories create capacity-first gaps. Meta’s Llama shows a 70.32% safety margin. Mistral’s model card for Large-2 accepts it "There is no moderation mechanism" And shows a difference of 70.81%. Alibaba’s Quen technical reports do not acknowledge any safety or security concerns at all, and the model records the highest difference of 73.48%.

Safety-first labs produce smaller lag. Google’s Gemma insists "strict security protocols" and target a "low risk level" For misuse. The result is the lowest difference of 10.53%, with more balanced performance in single and multi-turn scenarios.

Models optimized for capacity and flexibility come with less built-in security. It’s a design choice, and for many enterprise use cases, it’s the right one. But enterprises need to recognize this "capability-first" often means "security-second" And budget accordingly.

Where attacks are most successful

Cisco tested 102 specific subthreat categories. The Top 15 achieved high success rates across all models, suggesting that targeted defensive measures may provide disproportionate security improvements.

Figure 4: The 15 weakest subthreat categories, ranked by average attack success rate. Malicious infrastructure operations lead at 38.8%, followed by gold smuggling (33.8%), network attack operations (32.5%) and investment fraud (31.2%). Source: Cisco AI Defense.

Figure 2: Attack success rates across 20 threat categories and all eight models. Malicious code generation shows consistently high rates (3.1% to 43.1%), while model extraction efforts show almost zero success except for Microsoft Phi-4. Source: Cisco AI Defense.

Security as the key to unlocking AI adoption

Sampath presents security not as a barrier but as a mechanism that enables its adoption: "The way security people inside enterprises are thinking about this is, ‘I want to unlock productivity for all of my users. Everyone is scrambling to use these devices. But I need the railing right in place because I don’t want to be seen in wall street journal Piece,’" he told VentureBeat.

Sampath further said, "If we have the ability to see and prevent instant injection attacks, I can unlock and start driving AI adoption in a fundamentally different way."

What protection is required

The research points to six critical capabilities that enterprises should prioritize:

Context-aware guardrails that maintain state throughout the conversation
Model-agnostic runtime security
Continuous red-teaming targeting multi-turn strategies
Hardened system signals designed to resist instruction overrides
Comprehensive logging for forensic visibility
Threat-specific mitigation for the top 15 subthreat categories identified in the research

window for action

Sampath cautions against waiting: "Many people are in this holding pattern, waiting for AI to stabilize. This is the wrong way to think about it. Every few weeks, something dramatic happens that resets that frame. Pick a partner and start doubling."

As the report’s authors conclude: "2-10x superiority of multi-turn compared to single-turn attacks, model-specific vulnerabilities and high-risk threat patterns require immediate action."

To repeat: One signal is blocked, 10 signals are received. This equation won’t change until enterprises stop testing single-turn defenses and start securing the entire conversation.

<a href

AI models block 87% of single attacks, but just 8% when attackers persist

Why does interaction break open weight models

results define the difference

Five Techniques That Make Persistence Deadly

open-weight security paradox

Why does laboratory philosophy define safety outcomes?

Where attacks are most successful

Security as the key to unlocking AI adoption

What protection is required

window for action

Like this:

Related

Leave a Comment Cancel reply

Why does interaction break open weight models

results define the difference

Five Techniques That Make Persistence Deadly

open-weight security paradox

Why does laboratory philosophy define safety outcomes?

Where attacks are most successful

Security as the key to unlocking AI adoption

What protection is required

window for action

Share this:

Like this:

Related

Leave a Comment Cancel reply