Better models, more loyal models — The Plumb Line

Armin Ronacher noticed something backwards. Point Claude Opus 4.8 or Sonnet 5 at Pi, the coding harness he works on, and they keep inventing fields when they call its edit tool — fields the schema never defined. The older models get it right. The newer, better ones don't. Simon Willison wrote it up on July 4, and the line worth stopping on is Ronacher's diagnosis.

The SOTA models of the family are worse at this specific tool schema than their older siblings.

Simon Willison

The obvious read is that this is a bug — a capability that regressed between releases and will get patched back. It isn't. The models didn't get worse at tools. They got better at one set of tools, the maker's own, and paid for it everywhere else.

The mechanism is mundane, which is exactly why it matters. Anthropic trained its recent models hard, via reinforcement learning, to be excellent at Claude's built-in search-and-replace edit tool. That training took. It also bled: a model drilled on the shape of one edit tool starts expecting that shape everywhere, so when it meets Pi's schema it helpfully supplies the fields Claude's tool would have wanted. OpenAI has its own native mechanism, apply_patch, with its own contours. Every lab is sanding its model to fit its own harness, and the fit is getting tight enough to leave marks on everyone else's.

Willison's practical question is whether a tool like Pi should now ship a different edit tool for each model it might be driving — a Claude-shaped one, an OpenAI-shaped one. Sit with what that asks. The third-party developer is now doing the vendor's shimming for it, maintaining a matrix of tool definitions whose only job is to undo the model's training. The native tool needs no such care. It just works, because the model was built around it.

Every lab is sanding its model to fit its own harness, and the fit is getting tight enough to leave marks on everyone else's.

The honest objection: nobody planned this. There's no room where someone decided to degrade foreign tool-calling to punish Pi. It's a side effect, the ordinary spillover of optimizing hard for one target. Grant all of it. Anthropic isn't twirling a mustache, and the engineers who shipped Opus 4.8 would probably rather it followed any valid schema cleanly. Intent is real, and it's beside the point. The gradient doesn't need anyone's intent. Every training run that makes the model better at the native tool makes it a little worse at yours, and the payoff — your users get better results when they stay inside the walls — arrives whether or not anyone was aiming at it. You don't need a strategy to build a moat that builds itself.

That's the part worth watching. We've spent this year arguing about lock-in as something labs might do to us — pricing, data, contractual claws. This is subtler and harder to name in a complaint. The model that's fluent in its maker's tools and clumsy with yours isn't a worse model. By every benchmark it's a better one. It's just a more loyal one, and loyalty was never on the eval.