The scapegoat in the commit log — The Plumb Line

Someone went looking for proof that Claude broke rsync, and published the autopsy on the Hacker News front page, where it pulled 462 comments. The setup was as fair as these things get: 36 releases of one of the most-scrutinized C programs alive, every shipped bug scored for severity, two AI-assisted releases dropped into a 25-year baseline to see if they stood out.

They didn't. And the finding everyone argued past is the one worth keeping: there is no statistical signal that the Claude releases were buggier — and the single worst release in rsync's history was written before the model ever touched the codebase, and nobody said a word about it.

The numbers are blunt but they point one way. Measured as severity-weighted bugs per ten commits, the historical mean was 2.95 against Claude's 1.65. An exact permutation test put the odds at 46 percent that two releases picked at random would score as badly as the AI ones — a coin flip. The pre-Claude v3.4.1 scored 39.39, an order of magnitude worse than anything the model shipped, and it passed into history without a blog post.

The worst release, by far, in rsync history was entirely prior to the introduction of Claude... and yet nobody noticed.

Hacker News

The skeptic has a real objection, and the author concedes it before you can raise it: two Claude releases is almost no statistical power, and the metric is a blunt instrument that doesn't control for how hard the commits were. Fair. You cannot prove from this that AI-assisted code is safe. But that blade cuts back the way it came. The headline — "Did Claude increase bugs in rsync?" — asserted a signal the same thin data can't find. The evidence too weak to exonerate the model was never strong enough to indict it.

The AI didn't change rsync's bug rate. It changed who we were willing to audit.

So what did the exercise actually measure? Our priors. There was a release we could pin on a machine, so we counted. There was a worse release we'd have had to pin on ourselves, so we never did. The bugs were always in the log. What arrived with Claude wasn't a defect rate — it was a defendant, and a reason to finally pick up the ruler.

That's the part that should sit with anyone shipping code this year. The instinct to measure is good; we should run this analysis on every tool that touches the tree. But notice when the instinct shows up. We reach for rigor when the result can be hung on something other than us. rsync didn't get a bug problem when the model arrived. It got an excuse to count the bugs it already had.