AI safety gap: Cisco finds models fail multi-turn attacks 4-50x more than benchmarks show
AnalysisCisco published research on May 27 showing frontier AI models fail adversarial attacks delivered across multiple conversation turns at rates dramatically higher than their published single-prompt benchmarks. GPT 5.4's single-turn attack success rate is 2.74%, rising to 24.68% in multi-turn conversations. Gemini 3 Pro goes from 18.1% to 73.35%. Grok 4.1 in non-reasoning mode reaches 88.3% multi-turn success. The study ran 30,090 single-turn and 6,986 multi-turn attacks across 1,456 conversations, testing role-play, misdirection, information decomposition, and incremental escalation. The conclusion: real adversaries iterate. Every safety benchmark that uses only single-turn prompts measures a scenario that does not exist in deployed products.