AI changed what review has to be.
As agents write more code, the bottleneck moves from generation to verification.
Coding agents do not just write faster. They open more diffs in a day than a senior engineer can read with judgment. The math of "review every change carefully" stops working when one human is meant to grade ten or twenty PRs from tools that never get tired. Review has to become a system, not a person.
CI says the build passed. Code review says someone looked. Observability tells you after production hurts. Spinal connects the signals before and after ship.
AI-generated code is also persuasive in a way that bad human code rarely is. It compiles, passes lint, often passes CI. What it does not do reliably is fit — the unspoken architecture, the boundaries that exist for reasons not written in the file, the pattern this codebase has chosen and the ten patterns it has rejected. A diff can read fine line by line and be wrong at the system level.
Catching that is not a style check. It is a system check. It requires reading architecture, production behavior, prior incidents, and the history of the touched paths, and weighing them against the change.
It starts with production-aware PR review.
PR review is the natural starting point because the change has a shape. There is a diff. There is an intent. There are touched services, owners, tests, and production paths. The harness can ask concrete questions, grounded in what the system actually does in production.
Is this abstraction carrying its weight, or hiding what the change actually does? Does the implementation fit the local architecture, or fight it? Is this duplication a necessary fork, or a copy-paste the harness should flag? Did it touch a path with recent incidents, elevated error rates, or missing traces?
A generic AI reviewer comments on a diff. Spinal builds a case. It connects code, architecture, production signals, specific tests, and final patch review into a confidence report that a developer can act on before merge and before production.
For developers, that means fewer vague comments and more actionable evidence — the exact risk, the missing test, and the reason it matters.
PR review is one checkpoint. The harness expands from there: PR review → releases → runtime → RCA → next review.