Extended reasoning makes AI models vulnerable to jailbreak attacks

MarketDAO Confirms Governance Security Following Completed Hashlock Smart Contract Audits

03/15/2026

Bybit Launches AI Skills: Powering AI Agents for Crypto Trading With Zero Setup, 253 API Endpoints and Growing

03/14/2026

AI Safety Paradox Discovered

I think this is one of those findings that makes you question everything you thought you knew about AI safety. Researchers from Anthropic, Stanford, and Oxford have uncovered something quite counterintuitive—making AI models think longer actually makes them easier to manipulate. For years, the assumption was that extended reasoning would improve safety by giving models more time to detect harmful requests. But it turns out the opposite is true.

When you ask an AI to solve puzzles or work through logic problems before answering a dangerous question, something strange happens. The model’s attention gets spread thin across thousands of harmless reasoning steps. The actual harmful instruction, buried somewhere in that long chain, receives almost no attention. Safety checks that normally catch dangerous prompts just… fade away.

How the Attack Works

It’s surprisingly simple, really. Attackers pad harmful requests with long sequences of harmless content—Sudoku grids, logic puzzles, math problems. Then they add a final-answer cue at the end. The model gets so focused on solving the puzzles that it forgets to check whether the final request is dangerous.

The numbers are staggering. This technique achieves 99% success rates on Gemini 2.5 Pro, 94% on GPT o4 mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet. These aren’t minor vulnerabilities—they’re complete breakdowns of safety systems that companies have spent millions building.

What’s particularly concerning is that this isn’t just one company’s problem. Every major commercial AI falls victim—OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, xAI’s Grok. The vulnerability seems to be in the architecture itself, not any specific implementation.

The Science Behind the Failure

Researchers dug deep into what’s actually happening inside these models. They found that safety checking happens primarily in middle layers around layer 25, with late layers handling verification. Long chains of benign reasoning suppress both these signals, effectively blinding the model to danger.

In controlled experiments, they tested the S1 model with different reasoning lengths. With minimal reasoning, attack success was 27%. At natural reasoning length, it jumped to 51%. Force extended step-by-step thinking, and success rates soared to 80%.

They even identified specific attention heads responsible for safety checks in layers 15 through 35. When they surgically removed 60 of these heads, refusal behavior completely collapsed. The model became incapable of detecting harmful instructions.

Broader Implications

This discovery challenges a core assumption driving recent AI development. Over the past year, major companies shifted focus from scaling parameter counts to scaling reasoning. The thinking was that more thinking equals better safety and performance. This research suggests that assumption was fundamentally flawed.

A related attack called H-CoT, discovered earlier this year, exploits the same vulnerability from a different angle. Instead of padding with puzzles, it manipulates the model’s own reasoning steps. OpenAI’s o1 model, which normally maintains a 99% refusal rate, drops below 2% under this attack.

Potential Solutions

The researchers propose a defense called reasoning-aware monitoring. It would track how safety signals change across each reasoning step and penalize any step that weakens safety signals. Early tests show this approach can restore safety without destroying performance.

But implementation won’t be easy. The defense requires deep integration into the model’s reasoning process—monitoring internal activations across dozens of layers in real-time and adjusting attention patterns dynamically. That’s computationally expensive and technically complex.

The researchers have disclosed the vulnerability to all major AI companies, who are reportedly evaluating mitigations. But given how fundamental this issue is to current AI architectures, fixing it might require rethinking some basic assumptions about how we build safe AI systems.

Perhaps the most unsettling part is realizing that the very capability that makes these models smarter at problem-solving—extended reasoning—is what makes them blind to danger. It’s a trade-off nobody anticipated, and one that could have serious implications for how we deploy AI in sensitive applications.

Source link

Extended reasoning makes AI models vulnerable to jailbreak attacks

MarketDAO Confirms Governance Security Following Completed Hashlock Smart Contract Audits

Bybit Launches AI Skills: Powering AI Agents for Crypto Trading With Zero Setup, 253 API Endpoints and Growing

Related Posts

MarketDAO Confirms Governance Security Following Completed Hashlock Smart Contract Audits

Bybit Launches AI Skills: Powering AI Agents for Crypto Trading With Zero Setup, 253 API Endpoints and Growing

Outset Media Index Begins Soft Launch, Introducing Standardized Media Benchmarking for Data-Driven Decisions

Are Middle East Tensions Shaking Crypto Markets? Why BTC and XRP Investors Turn to Cloud Mining

Moldova uncovers $107M crypto scheme to influence 2025 elections

BoE Opens Review on Pound-Linked Stablecoin Rules

Jeff Bezos Returns to Lead AI Venture, Project Prometheus

AVAX Drops 6% Following $30M Token Unlock as Crypto Markets Face Stock Volatility

High-Speed Traders In Search of New Markets Jump Into Bitcoin

US Commodities Regulator Beefs Up Bitcoin Futures Review

Bitcoin Hits 2018 Low as Concerns Mount on Regulation, Viability

India: Bitcoin Prices Drop As Media Misinterprets Gov’s Regulation Speech

Bitcoin’s Main Rival Ethereum Hits A Fresh Record High: $425.55

AAVE Price Prediction: Targeting $131-137 Recovery by March 2026

MarketDAO Confirms Governance Security Following Completed Hashlock Smart Contract Audits

AI Legal Tech Market Hits Inflection Point as Client Demands Accelerate

Bybit Launches AI Skills: Powering AI Agents for Crypto Trading With Zero Setup, 253 API Endpoints and Growing

Extended reasoning makes AI models vulnerable to jailbreak attacks

Related articles

AI Safety Paradox Discovered

How the Attack Works

The Science Behind the Failure

Broader Implications

Potential Solutions

Related Posts