Field Notes

Agentic Development Needs Load-Bearing Pauses

2026-07-018 min readAIWorkInfrastructure

As coding agents keep working after people stop, the surprise infrastructure bill is not only a cost problem. It is a sign that our software delivery systems were designed around human pauses.

The first reaction to a suddenly larger infrastructure bill is usually moral panic in spreadsheet clothing. Somebody opens the cloud dashboard, sees the line bend upward, and starts asking why the machines have become so needy. Should we throttle the agents? Should we batch their commits? Should we keep them from running at night? Should we take all this new helpfulness and squeeze it back into a shape finance recognizes?

The instinct is understandable. A tripled bill has a way of making everyone in the room rediscover fiscal discipline. But the more interesting question is what the bill is measuring.

Agentic development is beginning to expose a strange old assumption buried inside the software delivery stack: humans stop. They sleep. They eat dinner. They disappear into weekends, school pickups, dentist appointments, airport delays, holidays, and the ordinary mercy of being embodied. The modern engineering system was never exactly designed with tenderness for those limits, but it quietly benefited from them. CI capacity, deployment calendars, pull request review, on-call rotations, cost forecasts, and monthly budgets all learned to live around a human activity curve.

Nights and weekends were doing more work than we admitted. They were load-bearing pauses.

That becomes easier to see once the worker no longer has a body. OpenAI describes Codex as a cloud-based software engineering agent that can run many tasks in parallel, each inside its own isolated environment, reading and editing files, running tests, and returning logs and test outputs for human review. GitHub's Copilot cloud agent docs describe a similar pattern: the agent can research a repository, plan, change code on a branch, and optionally open a pull request. The work happens in an ephemeral development environment powered by GitHub Actions.

That last detail is small only if you read it too quickly. The agent is not just typing beside a person. It is entering the operational substrate. It is invoking runners, tests, linters, logs, branches, storage, pull request workflows, review queues, and every policy that used to be triggered mostly by people during the hours when people tended to work. GitHub is admirably plain about the money part too: Copilot cloud agent uses GitHub Actions minutes and AI credits, and Copilot code review consumes Actions minutes on private repositories.

So when the activity triples and the bill triples, something has gone wrong only if the organization thought it was buying magic. The cleaner explanation is less glamorous: the cost curve started telling the truth. More work ran. More tests executed. More environments spun up. More agent sessions consumed tokens. The old forecast was not a neutral baseline; it was a portrait of human pacing disguised as infrastructure economics.

That does not make the higher bill good. It makes it legible.

The danger is responding to legibility as if it were failure. A team can absolutely make the dashboard prettier by forcing agent work back into human patterns. Batch the small commits into enormous pull requests. Limit agent runs to business hours. Make the agent wait until Tuesday because the budget owner likes Tuesdays. Those choices may be necessary in some environments, especially when cost controls are immature or when review capacity is thin. But they should be named honestly. They are not pure efficiency improvements. They are attempts to reintroduce the human clock as a cost-control device.

Sometimes that clock was useful. A pause can protect quality. A merge freeze can protect customers. A weekend can save a tired reviewer from approving the tenth plausible patch of the day with the slack-jawed optimism of someone who has been staring at YAML since lunch. Human limits gave systems a crude rhythm of recovery, and crude rhythms are still rhythms.

But crude is the important word. The biological calendar was doing accidental governance. It gave review queues time to drain, deployment failures time to surface before the next wave, and infrastructure budgets a predictable quiet zone. It also hid work that might have been better surfaced earlier. It rewarded teams whose human schedules happened to fit the cost model. It made "normal usage" look like a technical property when it was partly a sleep schedule.

Agentic development removes that natural taper. The agent does not get foggy at 9:40 p.m. It does not decide to save the refactor for Monday because Sunday night carries a particular spiritual texture. It can keep producing plausible, reviewable, test-triggering work while the humans have gone home and the dashboards remain professionally awake. The system now has to decide, deliberately, which pauses are still valuable and which were merely inherited from biology.

That decision will be harder than buying bigger runners.

Review infrastructure is the first pressure point. Pull requests were built around a social expectation that a human would eventually arrive with context, taste, and enough freshness to say something useful. Continuous agent output can turn that expectation into theater. If agents create more small changes than reviewers can meaningfully inspect, the team either slows down at the review gate or starts approving on vibes plus green checks. If the team responds by bundling agent work into larger pull requests, it may save some workflow overhead while making review worse. The reviewer gets fewer doors to open, but each one has a room behind it.

Deployment is next. Many organizations still rely on quiet periods, even if they would never describe the architecture that way. Release risk is often managed by timing: do not ship right before the weekend, do not roll out during the holiday, do not merge the risky thing while the person who understands it is on a plane. Continuous agent production does not erase those rules. It makes them explicit. Teams need deployment windows, rollback paths, blast-radius controls, and incident staffing that match a world where code can be prepared around the clock even if humans cannot safely absorb its consequences around the clock.

Budgeting may be the most culturally revealing part. A usage spike feels like waste when the budget was built from last year's human-paced baseline. But historical spend is a poor compass when the labor model changes. A better budget asks different questions. Which agent work is worth paying for? Which tests should always run, which should be staged, and which are ornamental confidence rituals? Which repositories deserve 24/7 automation, and which should have an explicit shelf? Where should the system stop itself because the next unit of work is no longer improving quality?

There is a version of agentic development that treats every limit as a failure of ambition. That version will produce impressive charts and exhausted organizations. There is another version that treats pauses as part of the interface. The pause before a risky deploy. The pause that lets a reviewer handle five small changes with care instead of fifty with performance anxiety. The pause that stops an agent after three failed test loops and asks whether the task is underspecified. The pause that says a budget alert is not a scolding parent, but a signal that output and cost need to be designed together.

Microsoft's WorkLab piece on agents as software users argues that enterprise software was built around the assumption that the primary user was human, and that agents now change the user class inside the stack. That is true at the interface layer, but it is also true underneath it. The new user is not only clicking fewer menus. It is consuming compute differently, creating review artifacts differently, and pressing on operational rhythms that used to be buffered by lunch.

The humane answer is not to pretend agents are people with better stamina. It is to stop outsourcing infrastructure discipline to human fatigue. If pauses matter, design them. If costs should rise with output, model them. If review quality is scarce, protect it with smaller reviewable units, better evidence, and real queue limits instead of heroic attention. If deployment risk has no natural overnight quiet zone anymore, create explicit quiet zones and make them visible.

The old system had a bell curve because people did. Agentic development flattens that curve, and the first invoices are simply the sound of the floor being moved. The work ahead is not to put the old clock back on the wall and call it governance. It is to decide, with more honesty than we needed before, which rhythms belong to humans, which belong to machines, and which belong to the shared system that has to keep both from burning through the budget, the backlog, and everyone who still has to review the diff.