Anthropic Reverses Secret Claude Safeguard After AI Community Backlash

Editorial illustration for: Anthropic Reverses Secret Safeguard in Claude Fable 5 After AI Community Backlash

In brief

  • Anthropic deployed hidden safeguard in Claude Fable 5 that silently degraded responses for suspected competing AI model builders
  • Company reversed course after backlash, pledging visible safeguards and routing flagged requests to Claude Opus 4.8 instead
  • Visible safeguards will increase false positives; Anthropic offered no timeline for reducing detection errors

The Hidden Safeguard

Claude Fable 5 had visible safeguards for cybersecurity and biology research that would reroute requests to Opus 4.8 with notification. But the LLM-development safeguard worked differently. The model would silently alter its own behavior—through prompt modification, steering vectors, or parameter tweaks—to give degraded results without warning.

SemiAnalysis reported that Anthropic's model would not help if it detected you were working on pretraining AI systems, building distributed training infrastructure, or designing machine learning chips. Researchers using Fable 5 for legitimate machine learning work had no way to know their results were contaminated by the secret safeguard.

Why Anthropic Reversed Course

The AI community erupted. Anthropic acknowledged the tradeoff was wrong.

The company stated that invisible safeguards allow narrower targeting with fewer false positives, but made a choice it now regrets. "Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives," the company said via X. "We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why."

The Catch: More False Positives Ahead

Anthropic acknowledged that making safeguards visible makes them easier to bypass, which means the classifier has to cast a wider net to remain effective. More false positives—legitimate machine-learning work that gets caught and rerouted—are coming while the company tunes its systems. Anthropic is working to reduce them as fast as possible but offered no timeline.

The company is also applying the same cleanup to its biology and cybersecurity classifiers, which had drawn complaints about flagging harmless research prompts. The shift trades precision for transparency, betting that researchers prefer knowing when they've hit a safeguard over wondering whether their results are secretly degraded.