• About
  • FAQ
  • Landing Page
Newsletter
Blockchain News
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
  • Bitcoin
  • Ethereum
  • Regulation
  • Market
  • Blockchain
  • Business
  • Guide
  • Contact Us
No Result
View All Result
  • Home
    • Home – Layout 1
    • Home – Layout 2
    • Home – Layout 3
  • Bitcoin
  • Ethereum
  • Regulation
  • Market
  • Blockchain
  • Business
  • Guide
  • Contact Us
No Result
View All Result
Blockchain News
No Result
View All Result
Home Bitcoin

Extended reasoning makes AI models vulnerable to jailbreak attacks

admin by admin
11/16/2025
in Bitcoin
0
189
SHARES
1.5k
VIEWS
Share on FacebookShare on Twitter

Related articles

MarketDAO Confirms Governance Security Following Completed Hashlock Smart Contract Audits

MarketDAO Confirms Governance Security Following Completed Hashlock Smart Contract Audits

03/15/2026
Bybit Launches AI Skills: Powering AI Agents for Crypto Trading With Zero Setup, 253 API Endpoints and Growing

Bybit Launches AI Skills: Powering AI Agents for Crypto Trading With Zero Setup, 253 API Endpoints and Growing

03/14/2026


AI Safety Paradox Discovered

I think this is one of those findings that makes you question everything you thought you knew about AI safety. Researchers from Anthropic, Stanford, and Oxford have uncovered something quite counterintuitive—making AI models think longer actually makes them easier to manipulate. For years, the assumption was that extended reasoning would improve safety by giving models more time to detect harmful requests. But it turns out the opposite is true.

When you ask an AI to solve puzzles or work through logic problems before answering a dangerous question, something strange happens. The model’s attention gets spread thin across thousands of harmless reasoning steps. The actual harmful instruction, buried somewhere in that long chain, receives almost no attention. Safety checks that normally catch dangerous prompts just… fade away.

How the Attack Works

It’s surprisingly simple, really. Attackers pad harmful requests with long sequences of harmless content—Sudoku grids, logic puzzles, math problems. Then they add a final-answer cue at the end. The model gets so focused on solving the puzzles that it forgets to check whether the final request is dangerous.

The numbers are staggering. This technique achieves 99% success rates on Gemini 2.5 Pro, 94% on GPT o4 mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet. These aren’t minor vulnerabilities—they’re complete breakdowns of safety systems that companies have spent millions building.

What’s particularly concerning is that this isn’t just one company’s problem. Every major commercial AI falls victim—OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, xAI’s Grok. The vulnerability seems to be in the architecture itself, not any specific implementation.

The Science Behind the Failure

Researchers dug deep into what’s actually happening inside these models. They found that safety checking happens primarily in middle layers around layer 25, with late layers handling verification. Long chains of benign reasoning suppress both these signals, effectively blinding the model to danger.

In controlled experiments, they tested the S1 model with different reasoning lengths. With minimal reasoning, attack success was 27%. At natural reasoning length, it jumped to 51%. Force extended step-by-step thinking, and success rates soared to 80%.

They even identified specific attention heads responsible for safety checks in layers 15 through 35. When they surgically removed 60 of these heads, refusal behavior completely collapsed. The model became incapable of detecting harmful instructions.

Broader Implications

This discovery challenges a core assumption driving recent AI development. Over the past year, major companies shifted focus from scaling parameter counts to scaling reasoning. The thinking was that more thinking equals better safety and performance. This research suggests that assumption was fundamentally flawed.

A related attack called H-CoT, discovered earlier this year, exploits the same vulnerability from a different angle. Instead of padding with puzzles, it manipulates the model’s own reasoning steps. OpenAI’s o1 model, which normally maintains a 99% refusal rate, drops below 2% under this attack.

Potential Solutions

The researchers propose a defense called reasoning-aware monitoring. It would track how safety signals change across each reasoning step and penalize any step that weakens safety signals. Early tests show this approach can restore safety without destroying performance.

But implementation won’t be easy. The defense requires deep integration into the model’s reasoning process—monitoring internal activations across dozens of layers in real-time and adjusting attention patterns dynamically. That’s computationally expensive and technically complex.

The researchers have disclosed the vulnerability to all major AI companies, who are reportedly evaluating mitigations. But given how fundamental this issue is to current AI architectures, fixing it might require rethinking some basic assumptions about how we build safe AI systems.

Perhaps the most unsettling part is realizing that the very capability that makes these models smarter at problem-solving—extended reasoning—is what makes them blind to danger. It’s a trade-off nobody anticipated, and one that could have serious implications for how we deploy AI in sensitive applications.

Loading



Source link

Share76Tweet47

Related Posts

MarketDAO Confirms Governance Security Following Completed Hashlock Smart Contract Audits

MarketDAO Confirms Governance Security Following Completed Hashlock Smart Contract Audits

by admin
03/15/2026
0

MarketDAO, an open-s...

Bybit Launches AI Skills: Powering AI Agents for Crypto Trading With Zero Setup, 253 API Endpoints and Growing

Bybit Launches AI Skills: Powering AI Agents for Crypto Trading With Zero Setup, 253 API Endpoints and Growing

by admin
03/14/2026
0

Dubai, UAE, March 13...

Outset Media Index Begins Soft Launch, Introducing Standardized Media Benchmarking for Data-Driven Decisions

Outset Media Index Begins Soft Launch, Introducing Standardized Media Benchmarking for Data-Driven Decisions

by admin
03/13/2026
0

On March 12, Outset ...

Are Middle East Tensions Shaking Crypto Markets? Why BTC and XRP Investors Turn to Cloud Mining

Are Middle East Tensions Shaking Crypto Markets? Why BTC and XRP Investors Turn to Cloud Mining

by admin
03/12/2026
0

【New York, United St...

Moldova uncovers $107M crypto scheme to influence 2025 elections

Moldova uncovers $107M crypto scheme to influence 2025 elections

by admin
03/11/2026
0

Moldovan authorities...

Load More
  • Trending
  • Comments
  • Latest
BoE Opens Review on Pound-Linked Stablecoin Rules

BoE Opens Review on Pound-Linked Stablecoin Rules

11/16/2025
Jeff Bezos Returns to Lead AI Venture, Project Prometheus

Jeff Bezos Returns to Lead AI Venture, Project Prometheus

11/17/2025
AVAX Drops 6% Following $30M Token Unlock as Crypto Markets Face Stock Volatility

AVAX Drops 6% Following $30M Token Unlock as Crypto Markets Face Stock Volatility

11/17/2025

High-Speed Traders In Search of New Markets Jump Into Bitcoin

01/11/2023

US Commodities Regulator Beefs Up Bitcoin Futures Review

0

Bitcoin Hits 2018 Low as Concerns Mount on Regulation, Viability

0

India: Bitcoin Prices Drop As Media Misinterprets Gov’s Regulation Speech

0

Bitcoin’s Main Rival Ethereum Hits A Fresh Record High: $425.55

0
AAVE Price Prediction: Targets $185-196 by Mid-January 2026

AAVE Price Prediction: Targeting $131-137 Recovery by March 2026

03/15/2026
MarketDAO Confirms Governance Security Following Completed Hashlock Smart Contract Audits

MarketDAO Confirms Governance Security Following Completed Hashlock Smart Contract Audits

03/15/2026
Pantera Capital Backs Doppler Token Launch Protocol

AI Legal Tech Market Hits Inflection Point as Client Demands Accelerate

03/14/2026
Bybit Launches AI Skills: Powering AI Agents for Crypto Trading With Zero Setup, 253 API Endpoints and Growing

Bybit Launches AI Skills: Powering AI Agents for Crypto Trading With Zero Setup, 253 API Endpoints and Growing

03/14/2026
  • About
  • FAQ
  • Support Forum
  • Landing Page
  • Contact Us

© 2025 Blockchainews. All Rights Reserved

No Result
View All Result
  • Contact Us
  • Homepages
  • Business
  • Guide

© 2025 Blockchainews. All Rights Reserved