Trusting code I didn't read
Leif ·
The line I stopped reading
I do not read every line anymore. For a while that felt like the thing I was not supposed to admit, the corner you are not allowed to cut. But the speed is real, and pretending I still read every diff top to bottom would just be a lie I tell to feel responsible. So I had to answer the actual question: if I am not reading the code, why do I trust it?
The answer cannot be "the model is good." That is faith, and faith ships bugs. When you stop reading every line, the trust has to come from somewhere, and the work is moving it from your eyes to something you can lean on. Being fast is worthless if you are confidently wrong, and confidently wrong is exactly the failure mode AI is best at. It will hand you a clean diff, a tidy summary, and a bug, all with the same calm tone.
Make it tell you where it is unsure
The first thing I do is ask. I ask the model how confident it is in the whole change. Then I ask about one file. Then I ask about one specific claim it made, the load-bearing line, the thing that breaks the most if it is wrong. The number is not the point. What I want is the gradient. The low-confidence spots are a map of where to go look, and they are usually honest, because a model has less reason to defend a guess it already flagged as a guess.
This is a manual habit, and I ran it enough times that I got tired of paying a model to produce a soft number on every change. That is why we built augur. It produces a risk score for a change deterministically, no model in the loop, so the same diff gets the same number every time. A human habit became a tool the moment I trusted it enough to want it automated and repeatable.
A stack of cheap signals
One signal is not trust. A stack of cheap ones is. I lean on four, and they are all cheap on purpose, because a check I dread running is a check I will skip.
Snapshot tests catch drift. I do not have to read the new output if I can see exactly how it differs from the output I already approved. Log-driven development lets me watch what the code actually did instead of guessing from how it reads; the logs are the code confessing, and they do not flatter themselves the way a summary does. Unit tests cover the claims I care about most. And visual surfaces close the loop: I have the agent generate an HTML file or a small visual so I can see the thing with my own eyes. A picture catches what "looks good" hides. Spacing, an off-by-one in a grid, a color that is wrong in a way no assertion was watching for.
How hard I read depends on the language
I do not apply the same pressure everywhere. Swift and Rust I read closely, because the types make reading pay off; the compiler has already done half the verification, so my eyes finish a job that is mostly done. TypeScript and the looser languages I read less and test more, because reading them buys me less and the language will not catch the thing I missed.
That is the whole trade. The trust has to come from somewhere. If it is not coming from your eyes, it has to come from your tests, and you do not get to skip both.