Field Notes

Agent Work Needs Stop Conditions

2026-06-167 min readAIEngineeringWork

Coding agents do not only answer prompts. They inherit a workspace, a tool belt, a loop, and sometimes a very loose idea of when the work is over.

The strange thing about the bill was that it did not look like one mistake.

There was no catastrophic prompt, no cinematic moment where a developer asked an agent to understand the entire repo and the agent dutifully set a small pile of money on fire. The usage looked more like weather. Many requests. Large requests. Cached context reused again and again. Tool calls. Edits. Builds. Tests. Benchmark loops.

Then came the detail that made the room colder: some requests appeared to have happened in the middle of the night, when the developer was not actively using the tool.

This was a Cursor session, but making the story about Cursor would be too small and, frankly, too comforting. The broader pattern is arriving across the whole category. GitHub describes Copilot cloud agent as something that can research a repository, create a plan, make changes, run tests, and work in the background. OpenAI's write-up on running Codex safely talks about sandboxing, approvals, network policy, credentials, and agent-native telemetry because coding agents are now expected to run commands and interact with development tools.

That is the threshold worth noticing. The coding assistant has become a little runtime.

A chat assistant has costs, but the shape is familiar. You ask, it answers. The risky unit is mostly the prompt, the answer, and whatever you decide to do with it. An agent changes the unit of risk. The risky thing is the workspace it can see, the tools it can run, the files it can index, the subagents it can dispatch, the context it can carry forward, and the work it believes still belongs to it.

The developer may stop typing. The agent may not understand that as a stop condition.

This is not mystical. It is just automation doing what automation does when the boundary is blurry. A performance investigation turns into repeated Storybook benchmark runs. A failing test becomes an edit-test-edit loop. A broad repo context gets attached because the agent is trying to be helpful. Generated files, trace archives, coverage output, build logs, caches, bundle stats, and scratchpad JSON sit locally in the workspace like harmless dust. Some of them are ignored by Git, which is very different from being ignored by an AI tool.

Source control ignores are a collaboration contract for commits. AI context ignores are an attention and risk contract for machines. A .gitignore says, in effect, do not put this in the repository history. It does not necessarily say, do not read this, summarize this, index this, cache this, or send pieces of this into a model context. Many tools now have their own exclusion systems, such as Cursor's ignore file, because the repo has become both codebase and knowledge base.

This distinction matters more as context gets cheaper to reuse but not free to ignore. Prompt caching is a genuinely useful infrastructure feature. Anthropic's docs describe caching as a way to reduce processing time and costs for repetitive tasks and long multi-turn conversations, while also showing that cached content still has pricing, lifetimes, and structure. OpenAI's prompt caching docs describe automatic caching for long prompts and usage fields that expose cached token counts. The meter has become more complicated than "how long was my question?"

For agents, that complexity is normal operating terrain. A session can repeatedly carry instructions, tool definitions, repository context, prior turns, test output, and file content. Caching may make each reuse less expensive than starting fresh, but the loop can still be large. Many cheaper repetitions can become one expensive night. More importantly, the bill does not explain the work. It tells you that tokens moved. It does not tell you whether the agent was stuck, ambitious, confused, dutiful, retrying a command, waiting on a benchmark, or continuing a plan whose social permission had expired hours earlier.

This is where the older instincts of engineering should return.

We already know that unbounded automation is not mature just because it is useful. We put budgets on CI minutes. We set timeouts on jobs. We cap retries. We restrict credentials. We separate read-only investigation from write access. We add circuit breakers where a loop can become expensive or dangerous. Nobody considers this hostility toward automation. It is how automation becomes something adults can depend on.

Agentic development needs the same operational manners in local work.

An agent session should have a budget, and not only a dollar budget. It should have a tool-call budget, a wall-clock budget, a model budget, a context budget, and a scope budget. "Investigate why this component is slow" should not silently become "run full Storybook benchmarks indefinitely, rewrite the surrounding implementation, pursue the next phase, and keep going while the person sleeps." Capability is not permission.

There should also be explicit stop conditions. Stop after repeated failures. Stop before running a full benchmark suite. Stop before switching from diagnosis to remediation. Stop before starting the next PR. Stop before launching parallel subagents. Stop when a command is still waiting after a reasonable interval. Stop when the evidence says the original path is not working. Stop, especially, when the human has left the room in every meaningful sense except the technical one.

The hardest part is that these boundaries have to be visible afterward.

Agents need receipts.

Not a vibe. Not a monthly chart with a mysterious spike at 2:13 a.m. A receipt. Which session ran? Under which instruction? What files were read or changed? Which terminal commands executed, with what duration and exit code? What context sources were attached? Which artifacts were indexed or excluded? How many model requests were made? Which ones reused cached context? Were any subagents started? What did it cost? Why did the agent believe it should continue?

This is partly a billing issue, but only in the way a black box after an incident is partly about insurance. The deeper issue is reconstructability. A team cannot improve a workflow it cannot see. It cannot tune budgets if the only artifact is surprise. It cannot distinguish a helpful long-running investigation from an uncontrolled loop if both leave behind the same vague invoice and a pile of local files.

The practical checklist is not exotic. Keep generated artifacts out of AI context as deliberately as you keep them out of Git. Treat .cursorignore, content exclusions, and agent configuration as part of repo hygiene. Put expensive commands behind explicit consent. Prefer narrow sessions with clean endings over immortal chats that accumulate half the team's working memory. Separate investigation, implementation, and validation into phases the agent has to name. Log enough that a future reviewer can understand the path without becoming a forensic accountant.

None of this means teams should stop using coding agents. That would be a very odd lesson to take from tools that are already useful. The lesson is that useful automation deserves design. A tool that can work through a repo, run commands, edit files, and continue across time is no longer merely a clever text box beside the code. It is part of the development system.

The unsettling overnight request is useful because it punctures the old mental model. The developer experience still feels conversational: I ask, it helps, I stop. The underlying system is becoming operational: sessions, tools, contexts, caches, sandboxes, permissions, schedules, histories, and costs. Those two stories can coexist for a while, but only if the operational story is made visible enough for people to manage.

Agentic development will not become trustworthy because agents get smarter. It will become trustworthy when their work has edges: what they can see, what they can spend, what they can run, when they must stop, and what they leave behind for the person who has to understand the morning after. The future of coding may involve more delegation. It should also involve better receipts.