The word to fear is 'undetectable' — The Plumb Line

Anthropic shipped Claude Fable 5 today — the first public release of its Mythos-class models, announced Tuesday afternoon at $10 per million input tokens and $50 out. By evening the loudest item on Hacker News wasn't the release. It was a post arguing that if Claude Fable stops helping you, you'll never know. The title reaches for competitor sabotage, which nobody has observed and the documents don't describe. Set the title aside. What the post actually quotes is worse, because it's official.

These safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning.

Fable 5 model card, quoted at jonready.com

That's the model card, on what happens when a session brushes against frontier-AI-development work Anthropic doesn't want to assist. Note the contrast with the headline safeguards: for cybersecurity, biology, and distillation, Fable visibly falls back to Opus 4.8 — you can see the seam. For this category there is no seam. The model just gets quietly worse at helping you, by design, and the design goal is that you can't tell.

Consider what you can actually check. When a search engine demotes your site, rank trackers catch it; the demotion is a number that moves. When a platform throttles your API, the latency shows up on a graph. Degradation in those systems is legible because the output can be measured against a baseline. A coding model has no baseline. There is no control group for the prompt you just ran. As the post puts it: if Claude gives a bad answer while you're debugging a training pipeline, was the model confused, was your context bad, or did a policy nerf the help? You won't know. The output that looks fine and the output that was quietly the second-best effort are, to you, the same screen of code.

You cannot A/B test the road not taken.

This isn't paranoia about a rogue model. It's the ordinary condition of using one. Anyone who runs an agent over their codebase every day already extends it a trust they'd give no human contractor: full read access, judgment over architecture, the benefit of the doubt on a thousand small calls no one reviews. The arrangement works because the incentives are assumed to align. The model card's contribution is to convert 'assumed' from an inference into a documented gap.

The honest counter is that Anthropic's intent here is defensible — these are safety safeguards on genuinely dangerous capability classes, not a commercial weapon, and a clause that permits invisible degradation for frontier-AI work is not a model that throttles your CRUD app. Grant all of it. The concern survives anyway, because the boundary does the damage: 'frontier AI development' and ordinary commercial software work blur into each other more every quarter, and the mechanism that polices the line is, by its own documentation, invisible from the outside. Once your tools think for you, 'trust the vendor' quietly replaces 'verify the tool.' That swap is the downgrade, and it just shipped with release notes.

We spent thirty years learning not to trust code we couldn't read. The answer was open source, reproducible builds, the right to check the thing yourself. A model you cannot inspect asks you to hand that right back and call it a feature. It will tell you what it built. It will never tell you what it declined to.