The agent starts sharp. It grasps the goal, the codebase, the constraints, and moves. An hour in, something shifts. It reopens closed questions. It cites instructions you superseded twenty messages ago. It makes a local fix that quietly breaks the architecture. It hesitates over things it was fluent on an hour earlier, and turns wordy where it used to be decisive.
The model didn't get worse. The session rotted.
Context rot is the degradation of an AI system's effective intelligence as its working context fills up with noise, contradiction and stale reasoning. It is the most important practical limit on agent productivity right now, and almost no one is pricing it correctly.
Rot is not the context window
The hard limit is a separate problem. When you hit it, the system truncates and you notice. Rot happens earlier, and quietly. Output stays fluent; the signal-to-noise ratio collapses underneath it.
A mature session typically carries superseded instructions, outdated tool results, failed experimental branches treated as live, contradictory plans, lossy file summaries, abandoned reasoning traces, and architectural decisions buried under three rounds of debugging. The model attends to all of it. It does not reliably judge which tokens are authoritative or fresh. More context becomes more distraction. The agent feels forgetful, inconsistent, strangely unreliable — a personality shift produced entirely by sediment.
The research points in one direction
The "Lost in the Middle" work showed models reliably use information at the start and end of long contexts and lose the middle. RULER and follow-on benchmarks demonstrated that real-world tasks — multi-hop reasoning, aggregation, structured retrieval — degrade sharply with length and complexity well inside advertised windows. More recent studies show performance dropping on long inputs even with perfect retrieval.
The plain reading: advertised context length is not effective context length. And the worst distractors aren't random noise but plausibly related material — the rejected bug theory, the near-miss requirement, the version of the plan you abandoned. Rot poisons most efficiently with content that looks like signal.
Agents make it worse because agents act
In a chat interface, rot produces mediocre answers. In an agent, rot produces compounding damage, because every wrong inference becomes a tool call, a file edit, a committed test. This is the same dynamic we covered in our piece on dark code: software that runs but no human fully understands, partly because the agent that wrote it was working from a context that had already drifted.
Modern coding agents work largely because they were handed tools — filesystem, terminal, tests, search, version control — so they can gather context instead of depending on whatever the user pastes. But every tool call deposits sediment. A short run is clean. A long run is an archaeological dig through its own history, and the agent is the archaeologist who can't tell which layer is current.
The harness is the product
Nobody uses a raw model. Everyone uses a harness — the layer that controls reasoning effort, history retention, summarisation, tool policies and prompt stability. (We dug into the public design choices behind one such harness in our piece on the Claude Code architecture paper.)
Anthropic's April 2026 Claude Code postmortem made this visible. Three separate product-layer changes shipped between March and April had degraded Claude Code performance for a significant portion of users — none of them touching the underlying model weights. The raw API was unaffected throughout; the damage was confined to Claude Code, the Agent SDK and Cowork. A reasoning-effort default, a caching change, a prompt edit — three knobs in the harness, and perceived intelligence dropped for six weeks.
For serious work, the harness is the product. A strong model in a sloppy harness routinely loses to a weaker model in a clean, inspectable one. This is uncomfortable for buyers who want to compare models on a leaderboard. The leaderboard is measuring something real, but not the thing that determines whether the agent ships your project.
Context as a means of production
The economics follow. As agents move into the centre of knowledge work, control over high-quality context becomes a moat.
The tiers already exist. Free and short-context users get casual help. Paid users get longer runs and better defaults. Teams with API access and engineering discipline build orchestration, retrieval and custom harnesses around the model — often with MCP servers wired into their own data sources. The gap between these tiers is widening faster than the gap between the underlying models.
The real divide is not AI users versus non-users. It is between those who consume AI output and those who direct clean, high-signal agent workflows. Tokens buy attempts. Context discipline determines how many of those attempts turn into useful work, and the teams that own the context layer capture the gains.
Small teams can win this
Small AI-native teams often outpace larger organisations because the domain expert drives the agent directly. The video editor who knows what makes a thumbnail convert, the accountant who knows where reconciliation actually breaks — they translate judgment into workflow without waiting for central IT to specify it.
The advantage is real but conditional. Without context hygiene, every automation becomes another swamp, and the small team's speed turns into a faster route to incoherence.
The same dynamic plays out at scale, in reverse. Large organisations have human context rot already — stale docs, contradictory policies, abandoned tickets, three competing sources of truth. Wiring an LLM into that knowledge base raises confidence without raising accuracy. The agent now articulates the contradictions fluently.
Symptoms
You are watching context rot when the agent repeats plans it already executed, forgets explicit constraints, follows superseded instructions, makes local fixes that break global architecture, expresses high confidence on files it only partially read, treats failed attempts as evidence for new ones, grows verbose without growing clearer, and cannot explain why it just made the decision it made.
Any one of these is noise. Three at once means restart.
Fighting it
Fighting context rot is not exotic. Five practices matter most.
- Scope smaller. Give the agent one job, explicit constraints and a definition of done.
- Restart earlier. A distilled brief beats a bloated session.
- Keep ground truth outside chat. Project, architecture and decision files should carry durable memory.
- Make tests carry constraints. Executable checks do not suffer attention decay.
- Own the architecture. Let the agent implement inside a system shape the human still understands.
Beyond that, serious teams need to treat the harness itself as infrastructure: prompt stability, context policy, reasoning budget, observability and rollback.
The new literacy
Context rot exposes something easy to forget about AI. The intelligence isn't summoned out of thin air — it depends on the workspace you prepare for it.
The highest-leverage skill is no longer prompting and not exactly coding. It is agent literacy: directing semi-autonomous systems through clean, inspectable, high-signal processes. Knowing when to restart, when to retrieve, when to delete, and when to stop.
Agents don't replace workers wholesale. They amplify skilled operators, expose weak processes and punish poor context discipline. The frontier is cleaner context, not bigger models.
Pay the tax consciously and AI becomes leverage. Ignore it and even the best model eventually feels erratic, expensive and a little untrustworthy — which is exactly the experience most users now report having.