Context Rot: Why Long AI Sessions Get Worse, Not Better
Long ChatGPT and Claude Code sessions quietly degrade as the conversation grows - here is why, what real engineering teams do, and your fix.

If you have ever felt that your ChatGPT chat or your Claude Code (Anthropic's terminal-based coding assistant) session has gotten dumber after the first hour, you are not imagining it. There is a name for what is happening, peer-reviewed research that documents it, and a small set of practical habits that fix most of it.
This guide explains context rot in plain English, walks through the analogy that actually helps people remember it, summarises what Slack's engineering team published on how they keep long-running agents on the rails, and ends with the half-dozen habits you can adopt today.
What is context rot, really?
Modern AI chat tools are powered by large language models or LLMs (the prediction engines behind ChatGPT, Claude, Gemini, and the rest). Every LLM has a fixed context window - the maximum number of tokens it can process in a single prediction. Claude Opus, Anthropic's flagship Claude model designed for complex reasoning tasks, 4.7 (Anthropic's largest model) and GPT-4.1 sit at around one million tokens. Gemini 2.5 goes to two million. Numbers that, taken at face value, suggest you could hand the model the entire Lord of the Rings trilogy and still have room.
The numbers lie. Or rather, they describe a theoretical capacity that the model does not actually use evenly. Stanford's 2023 paper "Lost in the Middle", later published in MIT Press's TACL journal, was the first widely-cited demonstration: when you put a critical fact at the very start or the very end of a long context, models retrieve it reliably. Bury the same fact in the middle, and accuracy drops sharply. A U-shaped accuracy curve, with the bottom of the U exactly where you do not want it.
Chroma's 2025 "Context Rot" research then tested 18 frontier models (GPT-4.1, Claude Opus 4, Gemini 2.5, Qwen3, and the rest) and found every single one of them got worse as input length grew. Their headline finding: a model with a 200,000-token advertised window can show significant degradation already at 50,000 tokens. That is the same model. Same prompt. Same task. The only thing that changed is how much you stuffed in beforehand.
Why does this happen at all?
The least-hand-wavy mechanism people point to is zero-sum attention. Each token the model generates has to spread a finite "attention budget" across every preceding token. The more tokens in front of it, the smaller the slice of attention each gets. Imagine you have one minute to brief a colleague on everything you know about a project. If the project is small, you give every detail real airtime. If it has been running for three years, the briefing is still one minute, but each detail gets a sliver.
Same model. Same intelligence. Less attention per item. That is most of what you are feeling when a long Claude Code session starts losing the plot.
How is this different from a chatbot getting confused?
Short ChatGPT chats rarely run into context rot - the budget per token is generous because the input is small. Where this bites hardest is long-running agents: an LLM driving multi-step work over time, like Claude Code editing a codebase, Cursor refactoring across files, or any autonomous research / customer-support setup that keeps adding tool outputs to the running context.
The analogy that sticks: imagine an assistant taking notes during a two-hour meeting. After the first ten minutes, the notepad is small and the relevant page is on top. After two hours, the assistant is flipping through 30 pages of paper trying to remember what was decided at minute 47. They have the same brain. The information is technically all there. But finding the right bit is now a different problem.
That is exactly what an agent looks like at hour two of a session - more context, less reliable retrieval, more confident-sounding mistakes.
How does Slack keep agents on track?
In an April 2026 engineering post, Slack's Dominic Marks described the architecture his team built for their long-running investigation agents. It is one of the cleaner real-world write-ups of how to deliberately fight context rot, and the pattern transfers to almost any agentic system.
Three things stand out:
1. They separate roles deliberately. A Director agent orchestrates the work. Expert agents do specialist sub-tasks. A Critic agent, running on a stronger model, reviews the experts' output for credibility. Each agent only sees what it needs - none of them carry the whole conversation.
2. They share state through structured channels, not chat history. Three artefacts pass between agents: a Director's Journal (six entry types - decision, observation, finding, question, action, hypothesis), a Critic's Review (every finding scored 0.0 to 1.0 on a 5-level credibility scale), and a Critic's Timeline (a chronological narrative that removes duplicates and flags gaps). Slack's most load-bearing architectural commitment from the post: "Besides these resources, we do not pass any message history forward between agent invocations." No transcripts. No rolling chat log. Just the three structured outputs that survive each round.
3. They run the system at real scale. Over 170,000 critic-graded findings: 37.7% trustworthy, 25.4% highly plausible, 15.4% misguided. The remaining ~21% were filtered as low-credibility before they could pollute the next round. The credibility scoring is the gate that stops bad findings from accumulating in the context the next agent sees.
The takeaway for anyone outside Slack: structured working memory plus periodic reflection beats a giant rolling chat log. Every time.
What does this mean for Claude Code or Cursor sessions?
Most of us are not building multi-agent investigation systems. But the same pressure applies to a single long Claude Code session that drags through dozens of tool calls. Your session gets slower, gets confused about earlier decisions, asks you to re-clarify things you already explained, or starts hallucinating function names. That is context rot in its everyday form.
The structural lessons translate:
- Start fresh sessions sooner than feels necessary. A new chat is the cheapest reset there is. If a task is genuinely new, start it new. Don't keep the previous hour's history around "in case it's useful".
- Summarise explicitly at the seams. When you move from one phase of work to another (research to design to implementation), ask the model to summarise the decisions so far. Then start the next phase with that summary as a fresh prompt. You have just done what Slack's Critic does.
- Tighten what the model can see. Most agentic tools let you scope which files are in context. Use it. A 50-file project is rarely the right context window for a single-file change.
- Pin the question, not the conversation. If you find yourself re-stating the goal three times because the agent drifted, that is the signal to stop, copy the goal, and start fresh with that goal in the system prompt.
Is the answer just "bigger context windows"?
That is what most marketing departments want you to believe. The research says otherwise. A bigger window gives you more capacity, not better retrieval. The U-curve stays U-shaped at any window size; you have just stretched out the middle where things go wrong.
The 2026 industry shift is away from chasing context-window numbers and toward better context management - the kind of structured-state, summarisation, reflection-pass approaches Slack and others are demonstrating. By one analyst estimate, about 65% of enterprise AI failures in 2025 were attributed to context drift or memory loss during multi-step reasoning. The fix is architectural, not bigger numbers.
What can I do about it today?
Six habits, in rough order of impact:
- Reset early. When a session feels off, don't try to nurse it back. Start fresh and paste in the goal.
- Summarise before context bloats. Every 30-60 minutes of a long session, ask the model to summarise what has been decided. Save the summary. Start the next phase with that summary as the prompt.
- Put the important thing at the very top or the very bottom. The U-curve is real. If a fact must be remembered, lead with it or end with it - never bury it.
- Trim tool output. If your agent's tools return huge dumps (file trees, search results), pre-filter to the rows you actually need. Each unused row eats attention.
- Use a smarter critic on shorter inputs. If you have access to a frontier model and a fast cheap model, run the fast cheap model for bulk work and the frontier model only for the review pass. Slack does this with their Critic.
- Stop measuring agent quality at minute one. Test your agentic setups at the point they start to hurt - hour two, ten tool calls deep. That is where the real degradation lives, and it is the only place worth optimising for.
None of this requires building a new tool. It is mostly a habit shift, and most people get a noticeable improvement in agent reliability within their first long session of trying.
Frequently asked questions
Q01Does context rot affect all LLMs equally?
Q02If a model claims a 1 million token context window, can I trust it for 1 million tokens?
Q03Is starting a new chat 'losing' my previous work?
Q04Does this mean RAG and long-context are dead?
Q05How will I know context rot is hitting my session?
ChatGPT vs Claude vs Gemini: Which Should You Use?
When to Trust AI Answers