Why not just use an LLM to review your pull requests?

It is a fair question, and we hear it often. You can open Claude or another model today, paste in a diff, and get a genuinely useful review. The models are good and getting better. So why would a team pay for a product that sits on top of them?

Because the model is the easy part. Turning a good answer on one diff into something a team can rely on — across every repository, every day, from every author — is a different problem. And most of that problem is the harness around the model, not the model itself.

An LLM reviews a diff. A team needs a review system.

A model can review one diff. That part is solved.

Let us concede the strongest version of the question first. Paste a diff into a capable model with a decent prompt, and you will often get sharp, specific feedback. If your need is a second opinion on a change you already have open in your terminal, you may not need anything more than that.

The gap is not the quality of the answer. It is everything that has to be true for that answer to show up reliably, on the changes that matter, for a whole team — without someone babysitting it.

A review you have to remember to run isn't a system.

The value of review is that it happens on every pull request — every repo, every author, automatically — including the changes from the people who would never think to ask a model. A reviewer that only runs when a conscientious engineer pastes a diff covers exactly the changes that least need covering.

Making review fire the moment a PR opens, server-side, for everyone, is infrastructure: webhooks, queuing, retries, and a service that is always on. That is not a prompt. It is the part that actually delivers coverage.

An opinion is not evidence.

"This looks risky" is where a model stops. Getting to "this is a bug" means writing a test for the specific risk, standing up the runtime and services the test needs, running it, and reading the result. That is an execution system — and it is where most do-it-yourself attempts quietly stop, because it is real engineering, not prompting.

Without it, you are back to a confident comment that a developer still has to verify by hand.

The setup has to belong to the team, not a person.

A useful reviewer carries standing instructions, repo knowledge, the ability to call your internal tools, a way to validate in your CI, and rules about what it must never see. Those have to be shared and persistent — owned by the repo, applied to every review.

If each engineer wires their own tools and prompts in their own terminal, you get a different reviewer per person and nothing the team actually owns. The configuration is the product as much as the model is.

And then the unglamorous 90%.

Redaction so sensitive data is stripped before the model ever sees it. An audit trail of every review. SSO and access control. Data residency for teams that need it. Memory, so the system gets better across reviews instead of starting from zero each time. Plus retries, rate limits, cost control, and keeping up as models change underneath you.

None of it is hard in isolation. All of it has to keep working, every day, or the reviewer quietly stops being trustworthy.

Build it if you want — just price the whole thing.

Plenty of teams build an internal version, and that is a legitimate choice. The mistake is pricing it as "a prompt and an API key" when it is a system with an owner, a backlog, and an on-call rotation. The model improves on its own; the harness only improves if someone keeps building it.

Spinal is that harness, already built and maintained — so your engineers can use a review system instead of becoming the team that maintains one.

Foundations / model vs harness

Use the system, don't maintain it.

Connect a repo and see what a full review harness does on a real pull request — automatic, validated, configured for that repo. 15 days free, no credit card.

Start free trial Sign in

Trust/Enterprise security·EU data residency

Read the security overview →