Field Notes
Second Opinions Need Standards
As AI tools start asking other AI tools to critique their plans, tests, and output, the humane question is whether the second opinion is defending a real standard or just making uncertainty look managed.
The new shape of AI assistance has developed an office ritual. Before the agent continues, it asks another agent what it thinks.
GitHub's latest Copilot CLI updates make this literal. A built-in rubber duck agent can review the main agent's plan, design, implementation, or tests, then return concrete criticism before the work moves on. The Copilot app is adding the same behavior into a broader agent-native desktop, where plans, diffs, canvases, browser sessions, terminals, and pull requests all become things a person can inspect. OpenAI describes Codex telemetry being used with an AI-powered security triage agent before humans decide whether a strange event is expected behavior, a benign mistake, or something worse. Anthropic's agent eval guidance keeps circling the same hard lesson from another direction: if your test cases and grading criteria are vague, the machine will optimize toward a fog bank and report back with numbers.
It is tempting to read all of this as a simple maturation story. The first agents wrote code. The next agents reviewed the code. The more responsible agents reviewed the reviewers. We are, apparently, inventing a little committee of software made from tokens, logs, and optimism. Anyone who has ever worked in an office should feel at least a small chill.
Still, the instinct is not wrong. Second opinions matter. A tired developer can miss the obvious edge case because it arrived at 5:47 p.m. wearing the same clothes as the previous six edge cases. A product manager can get attached to the first plan that sounds coherent after a week of scattered meetings. A security analyst can stare at a log line too long and begin to see either danger everywhere or nowhere. Another pass can help. Sometimes the best thing in the room is a critic with no calendar fatigue and no social obligation to pretend the draft is closer than it is.
But a second opinion is only valuable when it is attached to a standard. Without one, it becomes a very polished form of fidgeting.
This is the risk hiding inside AI self-review. The interface can make critique look like quality even when nothing meaningful has been settled. The agent says it consulted a critic. The critic found issues. The main agent addressed them. The transcript has the calm, documented texture of process. It feels safer than a single model charging ahead with the confidence of a junior consultant and the memory of a filing cabinet that has been struck by lightning.
What may be missing is the boring human material that makes criticism useful: the team's actual taste, the failure modes that have hurt before, the customer promises that cannot be bent, the accessibility bar nobody gets to skip, and the regulatory line that is not negotiable. A critic agent can spot contradictions, incomplete tests, suspicious assumptions, and sloppy reasoning. It cannot magically know what an organization is unwilling to become.
That distinction matters because AI products are making review feel cheaper. Not free, exactly. The economics are already complicated by tokens, Actions minutes, cloud sessions, and usage tiers. But psychologically cheaper. It is easier to request another pass when the reviewer is always there, does not sigh, and returns notes in tidy prose. It can also teach teams to substitute more review activity for clearer review principles.
You can see the human version of this failure in any workplace with too many approval steps and not enough judgment. The deck has been reviewed by seven people and still says nothing. The incident report has comments from every director and still avoids the cause. The product spec has passed through legal, compliance, design, engineering, and sales, and somehow the customer problem is still hiding under a cheerful section called "experience considerations." Process multiplied. Standards did not.
AI can accelerate that pattern. It can generate the first draft, critique the draft, revise the critique, summarize the revision, and produce a table explaining how all concerns were addressed. The danger is not that the machine refuses to review itself. The danger is that self-review becomes a substitute for someone saying, plainly, "This is not good enough because here is what good means here."
Good second-opinion design should make standards more visible, not less. If a rubber duck agent is reviewing a plan, it should know whether the team cares most about reversibility, latency, accessibility, cost, data handling, or maintainability. If a security triage agent is explaining strange behavior, it should preserve the chain from user intent to tool action to policy decision rather than flattening the whole thing into "low risk." If an eval suite is grading an agent, the task should be clear enough that two competent humans would mostly agree on success. This is rubric work. It is example work. It is the slow labor of making quality portable without making it stupid.
There is a humane side to this too. Better standards protect people from becoming the final landfill for machine uncertainty. When a second-opinion agent does real work, the human reviewer should receive sharper evidence, not a larger pile. The interface should say: here is the assumption that changed, here is the test that failed, here is what remains unresolved, here is the decision that still belongs to you. That is different from handing someone a transcript the length of a minor Victorian illness and calling it transparency.
The best version of these systems might feel less like an autonomous committee and more like a well-run workshop. The apprentice brings the draft. The critic points to the weak joint. The senior person does not have to reread every shaving on the floor, but can still see why the piece may wobble. Everybody knows what the object is supposed to become.
That last part is the real test. AI second opinions will probably become normal because the need is real. Agents will work across more files, sessions, tools, and time than a person can comfortably hold in their head. Review will need help. But if the help only gives us fluent doubt, synthetic confidence, or ceremonial critique, then we have just automated the meeting before the meeting.
The wider frame is not whether machines can check each other. Of course they can, and often they should. The deeper question is whether we are building organizations, interfaces, and habits where a check means something. A second opinion should not be a decorative pause in the march toward output. It should be a place where standards become legible, uncertainty is named cleanly, and the human still has enough context to exercise judgment rather than bless a conversation between machines.