Field Notes

Science Agents Need Strange Questions

2026-07-025 min readAIScienceWork

As AI research workbenches start running analyses, querying databases, and producing reproducible artifacts, the humane scientific question is whether they preserve exploration or quietly narrow it.

There is a particular expression a researcher gets when the data has become rude. The plot is wrong in a way that might be interesting. The sample sheet has one column that looks innocent and is absolutely not. A graduate student says, "This is probably nothing," which is how many laboratories announce the arrival of a small haunting.

AI is arriving for that moment with a better workbench. Anthropic's new Claude Science beta can run analyses, search scientific databases, manage compute, render proteins and genomic tracks, draft manuscripts beside the analysis, and attach provenance to figures, tables, notebooks, code, environment, and conversation. Google's AI co-scientist uses specialized agents for hypothesis generation, reflection, ranking, evolution, and meta-review. Google DeepMind's AlphaEvolve works in a more constrained setting, using Gemini models and automated evaluators to evolve algorithms that can be run, scored, and improved.

This is meaningful progress. Anyone who has watched a scientist spend an afternoon persuading a dependency stack to remember 2019 can appreciate an assistant that manages environments without treating the researcher like a junior sysadmin in a borrowed hoodie. If a lab can reproduce an analysis months later without interrogating a postdoc who has since moved cities and changed file naming conventions out of self-defense, good. That is civilization, lightly nerdy and blessedly versioned.

The subtler danger is that scientific AI may become very good at making the nearest plausible next question easier to ask.

A recent paper, AI Research Agents Narrow Scientific Exploration, puts a name to this worry. The authors tested research-agent frameworks across large numbers of generated AI and machine-learning ideas and found that AI-generated ideas were more concentrated than human-authored papers in the same areas, stayed closer to seed literature, and tended to differ by recombining existing methods rather than opening fundamentally new questions. That does not make the tools useless. Local elaboration is a huge part of science.

Still, local elaboration has a social shape. It rewards the question that is adjacent to the literature, compatible with available data, convenient for the evaluator, easy to describe in the vocabulary the system already understands, and likely to receive a high score from whatever quiet little tournament is happening inside the machine. Grant panels, tenure committees, journal incentives, and lab budgets already tug researchers toward legible ambition. Agents may add another layer of legibility so fluent that it feels like imagination.

You can picture the ordinary lab version. A principal investigator has one hour before a meeting and asks the system for possible experiments. The agent returns ten sensible ideas with citations, protocol sketches, caveats, and a figure plan. Nobody in the room has to ask, "What about the strange question we cannot yet phrase cleanly?" because the agenda is full and the assistant has been so helpful that objecting feels like deciding to churn your own butter.

This is where the interface matters. A science agent should not only produce ranked hypotheses and tidy paths. It should preserve the wilderness around them: ideas rejected because they lacked obvious evidence, proposals that are merely familiar methods wearing new hats, questions that are under-specified, interdisciplinary, annoying, or currently unfundable.

Good scientific workbenches will need something closer to a lab notebook than a productivity dashboard. The notebook does not only record the victorious figure. It records the odd observation, the failed reagent, the abandoned path, and the weird aside that becomes important six months later. Claude Science's emphasis on artifact history and background review points in the right direction. But provenance should not stop at reproducibility. It should also preserve exploration.

That means making uncertainty durable: which questions remain open, which negative results were informative, which generated ideas clustered too tightly, and which paths were not pursued because the model, the metric, or the lab's incentives preferred the smoother road.

The expensive part of science is not only running the experiment. It is learning what kind of question the experiment deserved. Access without exploratory range can make science faster in the way a hallway is fast: efficient, direct, and not especially good at discovering rooms no one remembered were there.

We should also be careful with the phrase "co-scientist." It flatters the collaboration and hides the asymmetry. The system can generate, rank, and refine. It does not have a body in the lab, a career shaped by a failed line of inquiry, a patient waiting on a therapy, or a memory of the sample that smelled wrong.

The better future is science tooling that treats acceleration as incomplete unless it also protects divergence. Faster literature review, reproducible figures, managed compute, and database connections are excellent. They should be paired with interfaces that notice when a field is becoming too neat, when the idea space is collapsing inward, when the obvious next step has started to feel like the only adult option.

Scientific progress has always needed rigor. It has also needed stubbornness, taste, boredom, accident, defiance, and people willing to follow an ugly question past the point where it looks professionally efficient. If agents become part of the laboratory, the design challenge is not simply to make them careful enough to trust. It is to make them strange enough, or at least hospitable enough to strangeness, that they do not sand the best questions down before anyone realizes what has been lost.