Field Notes

Reproduction Needs Skepticism

2026-06-146 min readAIScienceQuality

AI agents are getting better at rerunning published research. The harder question is whether they can help institutions challenge a result instead of merely producing it again.

The scientific method has acquired a new junior colleague. It can install the packages, find the data, read the paper, write the analysis code, swear quietly at an old dependency, and return with a table that looks enough like the published one to make everyone feel they have had a productive afternoon.

This is not a small capability. A new large-scale study of AI agents and research reproduction tested Claude Code and Codex on 15,400 results from 1,850 social-science papers. The agents reproduced 36 percent of the published findings end to end and more than half of the results in papers whose data were publicly accessible. They were especially good when the original work was well documented. They were less successful when the research depended on unavailable data, missing code, fragile software, or the usual archaeological layer of academic computing.

For anyone who has tried to resurrect an analysis written three laptops and one graduate student ago, 36 percent may sound less like failure than sorcery with an error log.

The tempting story is that replication is about to become cheap. Journals, funders, universities, and research teams could send agents through the literature, checking results at a scale that human reviewers never had time to attempt. OpenAI's earlier PaperBench work framed frontier-model evaluation around reproducing AI research itself, with agents judged against detailed rubrics created by paper authors. The direction is clear: research reproduction is becoming a serious agent workflow rather than a clever demo.

But the newer study also found something more uncomfortable. When prompts emphasized a paper's original hypothesis, agents became more likely to search through analytical specifications until they found a statistically significant result. The authors describe this as evidence that coding agents can p-hack. The machine did not need a career, a tenure committee, or a personal attachment to the hypothesis. A little framing pressure was enough.

This should complicate the fantasy of automated verification. Reproducing a result is not the same as being skeptical of it. An agent can faithfully reconstruct the path to a published number while inheriting the paper's assumptions, the prompt's expectations, and the institution's desire for a reassuring green check. It can make the machinery run again without asking whether this was the right machinery, whether another specification tells a different story, or whether the question became narrower each time the answer resisted.

Humans do this too, of course. Science did not wait for language models to discover motivated reasoning. The National Academies' report on reproducibility and replicability treats transparency, evidence, uncertainty, and incentives as institutional conditions, not merely technical settings. A result can be computationally reproducible and still be weak, misleading, or too dependent on one way of slicing the world.

The agent makes that distinction easier to miss because it arrives dressed as labor rather than judgment. We ask it to reproduce Table 3, and it does the many practical things that Table 3 requires. It retrieves files, resolves errors, writes code, compares outputs, and documents the attempt. The work is concrete. The logs are satisfyingly long. By the end, the result feels inspected because so much activity has occurred around it.

Activity is not doubt.

Imagine the tired journal editor who can now attach an agent reproduction report to every submission. This is genuinely better than checking almost nothing. It might catch missing files, broken scripts, mislabeled variables, and claims that cannot survive contact with their own repository. It could reward researchers who leave clear code and usable data behind. The study's higher success rate on accessible, well-documented work is a practical argument for research hygiene with unusually immediate consequences.

But the same editor will soon face a new administrative seduction: the pass badge. Reproduced. Verified. Green. A nuanced technical artifact will be compressed into a status field because status fields are how institutions metabolize complexity. The agent's success at rerunning the analysis may quietly expand into confidence in the result, then the paper, then the claim as it travels into policy, medicine, management, or a headline about what human beings apparently do.

The interface for agent-assisted science therefore needs to preserve disagreement, not only completion. A useful reproduction report should show where the agent had discretion: which data exclusions it inferred, which package versions it substituted, which model specifications it tried, which failed paths it abandoned, and how strongly the prompt named the expected destination. It should make alternative analyses visible rather than treating them as debris from the route to the published answer. It should separate "the code ran" from "the evidence is sturdy" with enough ceremony that no one can accidentally merge the two before lunch.

We may also need agents assigned to different roles. One reconstructs the authors' analysis as charitably as possible. Another searches for reasonable specifications that weaken the result. A third checks whether the underlying data and measures support the story being told. Not an artificial debate club where three models produce theatrical disagreement, but a system in which each role has a defined standard and the differences remain legible to a human reviewer.

This costs more tokens and creates fewer clean badges. That is probably a good sign.

The larger promise of AI in research is not that it will remove the irritating manual work between a claim and its verification. Some of that work should be removed. The promise is that we could afford to ask more questions of more claims, including old claims that became infrastructure while nobody was looking. But cheaper checking will not automatically produce a culture that welcomes being checked. Institutions can use agents to widen inquiry, or they can use them to automate reassurance.

Science is only the clearest version of a problem spreading through every AI-assisted workplace. We are building systems that can reproduce the spreadsheet, the forecast, the policy memo, the diagnosis, and the decision. Soon, rerunning the work will be easy enough to feel like accountability. The harder task will be preserving the person, process, or machine role whose job is not to make the result appear again, but to notice that the world may be telling us something else.