Field Notes

AI Code Review Needs Disagreement

2026-05-255 min readAICodingQuality

As AI review moves from comments into suggested fixes and agentic repair, the important question is whether teams still have enough friction to notice what quality actually requires.

Code review is becoming less like a meeting and more like a conveyor belt with opinions.

That sounds efficient, which is why it deserves suspicion.

GitHub now treats Copilot code review as a normal part of the pull request surface. It can be requested by a person, triggered automatically through repository rules, leave comments on changed lines, and help apply feedback through Copilot-assisted changes. OpenAI's Codex work points in the same direction from another side: coding agents are not only generating patches, but operating inside sandboxes, asking for approvals, leaving logs, and moving toward reviewable work packets.

The center of gravity is shifting.

AI is no longer just helping write the code that enters review. It is starting to participate in the review itself, and then in the repair of what review finds.

This is useful in the plainest possible way. A machine that catches a missing null check, a suspicious migration, a flaky test, or a common security mistake before a human has to spend attention on it is doing real work. Many pull requests contain problems that are boring to find and expensive to miss. Review queues are overloaded. Senior engineers are scarce. Teams ship under pressure. Anything that raises the floor of ordinary review is worth taking seriously.

But the floor is not the whole room.

The deeper purpose of code review has never been only defect detection. It is also where a team argues with itself about standards. Review is where private taste becomes shared taste, where architecture is defended or corrected, where naming gets less careless, where a shortcut is judged against the future maintenance cost, where a junior engineer learns why the obvious solution is not always the responsible one.

Review is not just a quality gate.

It is a cultural instrument.

That makes AI review stranger than the demos imply. If a model suggests the change, another model reviews the change, and an agent helps apply the comment, the workflow can begin to look satisfyingly closed. The loop has movement. It has artifacts. It has comments. It has green checks. It has the emotional shape of diligence.

The danger is that the loop may contain less disagreement than it appears to.

Good review depends on a particular kind of friction. Not bureaucracy, not performative nitpicking, not the slow cruelty of making someone defend every line. The useful friction is the moment where someone says: this technically works, but it makes the system harder to understand. This patch passes tests, but it teaches the wrong abstraction. This feature ships faster, but it moves complexity into the place where the next person will least expect it.

AI can assist that conversation. It cannot replace the need for it.

This matters because software teams are very good at mistaking throughput for health. A review system with faster comments, faster fixes, and fewer idle pull requests will look better on many dashboards. Cycle time improves. Open review counts drop. More code lands. The organization feels less blocked.

None of those numbers can tell you whether the team is becoming more thoughtful.

They may even hide the opposite.

A team can become faster at accepting plausible work. It can normalize machine-shaped comments that catch local issues but miss the broader tension. It can train people to resolve review as a checklist rather than a conversation. It can preserve the ritual of review while weakening the practice that made the ritual matter.

That is the uncanny part. The interface still says review. The pull request still has comments. Someone still clicks approve. But the human act inside the review may shrink into routing: accept, reject, regenerate, apply, rerun.

There is a version of this future that is genuinely better. AI review handles the mechanical pass. It finds common mistakes early. It explains suspicious code paths. It gives reviewers a cleaner starting point. It notices inconsistency across a diff. It helps authors address small fixes before asking a teammate for scarce attention.

In that version, the human reviewer gets more room for judgment.

But that outcome is not automatic. It has to be designed.

Teams will need to decide what machine review is allowed to settle and what it is only allowed to surface. They will need language for the difference between a fixable comment and a disagreement about direction. They will need review norms that make it acceptable to pause an AI-smoothed pull request because something feels wrong at the system level. They will need to preserve comments that explain taste, not only comments that produce patches.

They will also need to watch for the quiet transfer of responsibility.

When an AI reviewer misses a bug, the bug still belongs to the team. When an agent applies a suggested change, the change still belongs to the author. When a repository rule requests a review automatically, someone still chose the rule, trusted the surface, and decided how much weight its comments should carry.

Automation does not dissolve ownership.

It often makes ownership easier to avoid.

The best AI review tools will probably be the ones that make disagreement more legible rather than less. They will show evidence, uncertainty, and scope. They will distinguish local correctness from architectural concern. They will help reviewers ask better questions instead of only producing more comments. They will make it clear when a suggestion is routine and when a human should slow down.

That is a less glamorous product promise than "review faster."

It is also closer to what serious software needs.

The future of code review should not be a world where every pull request glides through a machine-polished approval lane. It should be a world where low-value friction disappears and high-value friction becomes easier to protect.

Because the health of a codebase is not measured only by how quickly changes enter it.

It is measured by whether the people responsible for it still know how to disagree on behalf of the future.