Field Notes

AI Workflows Have Scar Tissue

2026-06-125 min readAIWorkSystems

A prompt stack is often a record of old failures, anxious fixes, and institutional memory. Smarter models do not erase that history; they make us decide what still deserves to survive.

Somewhere inside a production AI workflow, there is probably an instruction that says not to begin every email with "I hope this message finds you well." Nearby, another instruction explains that the customer is already angry, the legal disclaimer must remain intact, the source links cannot be invented, and the word "delighted" should under no circumstances appear in a message about a denied insurance claim.

None of these rules arrived as a grand design. They accumulated. A model behaved strangely, somebody noticed, and a sentence was added. Then another. The prompt became a small municipal code written after a series of highly specific accidents. Do not park here. Do not mention the competitor. Do not summarize the table before reading the footnotes. Please stop using semicolons like you have just discovered them.

This is why the latest model upgrades are more interesting than their benchmark charts. OpenAI's guidance for GPT-5.5 explicitly says to treat it as a new model family rather than a drop-in replacement, begin with a fresh baseline, and avoid carrying every instruction from an older prompt stack. The new model may need less scaffolding, different tool descriptions, clearer success criteria, and a new balance of reasoning, cost, warmth, and brevity.

In other words, the smarter model has arrived, and one of its first requests is that we clean out the garage.

That sounds technical. It is also organizational archaeology. A mature prompt is rarely just a prompt. It contains product decisions, support policies, quality standards, compliance language, tone anxieties, edge cases, and the memory of a Wednesday afternoon when the bot confidently refunded something no one had purchased. Some instructions are essential. Some are obsolete. Some were always superstition but became policy because the output improved once and everyone was too busy to investigate.

The prompt stack has scar tissue.

Scar tissue is not the same as damage. It is evidence that a system adapted. The trouble begins when adaptation becomes indistinguishable from intent. A team inherits a forty-page instruction file and assumes every line expresses a deliberate standard. Often it expresses a former model's habit. "Always think step by step" may be compensating for weak reasoning. A rigid response template may be holding together a model that wandered. Six warnings about tool use may have been written before the tool schema improved. The workflow still runs, so nobody wants to touch the strange sentence supporting the ceiling.

Model churn makes that hesitation expensive. Anthropic's current deprecation schedule includes June and August 2026 retirement dates and advises teams to test replacements thoroughly before the old models disappear. Google's Gemini schedule shows preview models with shutdown dates only months after release. Microsoft documents cases where hosted deployments can be automatically upgraded to a new model, while provisioned customers must migrate manually. OpenAI's own deprecation page now covers not only models but reusable prompts, eval infrastructure, and agent-building tools.

The practical lesson is that an AI product is not sitting on a foundation. It is standing on a moving walkway while someone replaces the handrail.

Teams usually understand migration as an engineering task: change the model name, run the tests, watch latency, compare cost, deploy gradually. All necessary. But the more consequential work is deciding which learned constraints still belong to the product. That decision cannot be delegated entirely to the people closest to the API. The support lead knows why a particular apology matters. The lawyer knows which phrase is mandatory and which was merely preferred by someone who left eighteen months ago. The designer knows that the old model needed a blunt command to sound human, while the new one becomes syrupy if asked for warmth twice.

This is maintenance of institutional judgment. It needs owners, examples, and time.

A humane migration starts by separating the contract from the coping mechanism. The contract is what must remain true for the person on the other side: the answer is accurate, the refusal is understandable, the recommendation names uncertainty, the action does not exceed permission, the tone fits the moment. The coping mechanism is how an earlier model was persuaded to honor that contract. When the model changes, the contract should stay visible while the mechanism earns its place again.

This is where representative examples matter more than prompt folklore. Put the new system in front of the awkward cases: the grieving customer, the ambiguous spreadsheet, the codebase with two conventions and one exhausted maintainer, the medical question that sounds casual but is not. Ask whether the work still feels careful. A migration can improve an aggregate score and make one important human experience noticeably worse. The dashboard may call that acceptable variance. The person receiving the cold, efficient answer will use a different phrase.

There is a health question here too. Constant model change can turn teams into permanent interpreters of machine temperament. Every few months the voice shifts, defaults move, tools behave differently, and the people maintaining the system must rediscover what "good" means. Without a migration budget, that work becomes invisible overtime performed by whoever is most bothered by quality. Usually this is the same person who notices that the new assistant is technically correct and somehow less kind.

The industry likes to describe model progress as capability arriving. In practice, capability also rearranges responsibility. Better models can remove pages of defensive instruction, but only if somebody is allowed to examine why those pages existed. They can make workflows simpler, but simplicity is not automatic. It is the result of careful forgetting.

That may be the real discipline required by rapidly improving AI: not merely learning what the new system can do, but deciding what the old system taught us about ourselves. The durable asset was never the incantation. It was the standard hiding inside it, waiting to be named clearly enough that it could survive the next intelligence.