On Reviewing Code in the Age of LLMs
15 Mar 2026How do you maintain quality when reviewing high volumes of LLM-generated code? I’d argue we don’t have to look far for the answer.
Everything that is true about reviewing LLM-generated code is the same as reviewing human-generated code, only more so. In other words the difference is one of degree, not in kind. Humans are fully capable of misunderstanding concepts, creating the wrong abstractions, overlooking interactions, etc. The risk with adding LLMs into the mix is that these risks are magnified; it’s tough for a human to erroneously update dependencies in thousands of packages manually, but it’s well within the realm of possibility for an LLM-powered tool to do so.
There are a couple interesting conclusions that follow from this perspective. The first is that drawing a line between LLM-generated code versus human-generated is a distraction. You may be tempted to institute a standard like subjecting all LLM-generated changes to additional forms of scrutiny, but at what point does that become necessary? If it’s fully LLM-generated? What about coauthored with an agentic LLM? Or written with the help of LLM-driven IDE autocomplete? The line gets even blurrier with models capable of higher level reasoning. If a human fully wrote the code based on an architecture plan generated by an LLM, does that deserve extra scrutiny?
The second, and more actionable, conclusion is that whatever guardrails are useful for reviewing LLM-generated code are also likely applicable to human-generated code. For example, how do you ensure reviewers’ time isn’t wasted? You review it yourself first to catch obvious issues, the same thing you (should) do even if an LLM isn’t involved. How do you mitigate hallucinated library usage? Robust automated tests, the same ones that protect us from erroneous human-generated code. How do you ensure consistent style when an LLM is writing most of the code? Automated linting rules. How do you protect against misguided tool-driven refactoring? Move slowly, deploying carefully reviewed changes in small batches until enough confidence is gained that you scale up. These questions aren’t unique to code authored with/by LLMs, but they do become more important.
My recommendation for anyone worried about protecting against the increased risk of LLM-generated code is to lean more heavily on the tools we already have for ensuring code correctness and quality. This attitude doesn’t come from a flippant dismissal of the risks involved; rather, it’s an appeal to pragmatism. Alternatives tend to boil down to good intentions (e.g. requiring authors to disclose the extent of LLM usage in a PR description) and increased vigilance (e.g. reminding reviewers to look extra hard for particular kinds of problems). But neither of these has the makings of a reliable long-term solution.