A coin flip in a lab coat — The Plumb Line

A developer ran his own résumé through HackerRank's newly viral open-source applicant tracker and scored 90 out of 100. He cleaned up some leftover debug prints, ran it again, and got a 74 — same résumé, same command, the only change being code that never touched the scoring. So he disabled development mode and looped it a hundred times. The scores ran from 66 to 99.

Read that range as what it is. If a company sets its cutoff at 85 — a normal place to put it — the same résumé fails 65 percent of the time. Not a borderline candidate getting borderline results. One fixed document, graded by a system that returns a different verdict depending on nothing. That is not a strict filter or a lenient one. It is a coin flip in a lab coat, and it now sits at the front of the hiring pipeline at every company that bought the idea a number means rigor.

The useful part of his write-up isn't the headline variance; it's where the variance went. He split the score by category. Technical skills came back 8 out of 10 in 98 of 100 runs — rock steady — because skills are a checklist: you know React or you don't, and there's nothing to judge. Experience was steadier still. 25 out of 25, every single run. A principal engineer with a decade of distributed systems gets 25. A new grad with one internship gets 25. He gets 25. The prompt grading experience is two lines long, with no rubric and no anchors, so it can't tell them apart — and its reward for measuring nothing is perfect consistency. Projects, the one category with a real rubric, was the noisiest of all.

That's the whole machine in one breath. Where it's consistent, it's blind; where it tries to discriminate, it's random. No temperature setting buys you both — he dropped to temperature 0, and a GitHub issue from October shows the same résumé scoring 27, 34, 32, 34, 34, 30 on six straight runs. He's right to call this a design flaw rather than a bug.

This non-determinism isn't a bug you can just fine-tune away, it's a fundamental design flaw.

Dan Unparsed

A bug you fix. This is the tool doing exactly what it is: handing a four-billion-parameter model a stranger's project history and asking it to put a number on a judgment it cannot actually make, then writing down whatever came back.

Where it's consistent, it's blind. Where it tries to discriminate, it's random.

The fair objection is that humans are no better. A tired recruiter at the bottom of a 200-résumé stack misreads yours too, and at least the machine is fast and cheap. True. But a human's bad afternoon is uncorrelated noise — a different reader, a different day, and you get a fresh look. A deployed model is correlated. If its weights happen to score your kind of work low, you don't fail once; you fail the same way at every company running that screen, with no bad day to blame and no second reader behind it. The variance doesn't average out across the market. It compounds. And the candidate never sees the number, can't contest it, and has no way to know a re-run would have cleared him.

Screening was always a vibe check; HR has spent decades trying to launder the gut feeling out of it. This tool doesn't remove the gut feeling. It gives it a score out of 100 and a temperature parameter and calls the result a measurement. The gate didn't get more objective. It just stopped admitting it was guessing.