What Actually Happens Inside Claude Code or Cursor
How AI coding agents actually work under the hood - explaining the agent loop, prompt caching and context compaction with practical takeaways.

If you've been using Claude Code, Cursor, or any other AI coding assistant and wondered what's actually happening between you typing a request and the agent making three edits to your codebase, the answer is more mechanical than the demos suggest. OpenAI published the architecture of their Codex agent in early 2026, and the same pattern (with small variations) is what every other agent uses.
Understanding it explains a lot of the weirdness: why your session gets slower after an hour, why restarting helps, why some tools work better than others.
What is the agent loop?
The thing in the middle of every AI agent is a six-step loop:
- Take everything in the conversation so far (your messages, the agent's replies, tool outputs) and stitch it into one big prompt.
- Send the whole prompt to the model.
- The model replies, either with text for you, or with a request to use one of its tools (read a file, run a command, search the web).
- If it's a tool request, the agent runs the tool and captures the output.
- Append the tool output to the conversation.
- Go back to step 1.
Most chat tools you've used hide this loop completely. But the loop is what's running, and once you see it, several things make more sense.
Why do long sessions get slow?
Because of the loop's mechanics, the cost of each turn is roughly proportional to the size of the whole conversation so far. After 10 turns, the agent is re-sending 10 turns' worth of text. After 50 turns, 50 turns' worth. The total bytes processed over the lifetime of a session is roughly O(n²) - quadratic in the number of turns.
That isn't a bug. It's the fundamental shape of how transformers (the architecture underneath ChatGPT, Claude, and the rest) process input. Every token in the input has to attend to every other token, and the conversation history is part of the input on every call.
The practical consequence: a 50-turn session isn't 5x slower than a 10-turn session. It's something like 25x slower per turn, and the model is also less reliable at retrieving information from earlier in the conversation as it goes (the context-rot problem covered in a separate post).
How do agents try to soften this?
Two main tricks, both visible in the published Codex architecture:
1. Prompt caching. Sending the same opening 5,000 tokens of system prompt and tool definitions over and over is expensive. The model providers (OpenAI, Anthropic, Google) cache the work they did processing those tokens the first time, so on every subsequent call they just look up the cached state and start processing from where the conversation diverges. The catch: the cache only hits when the prefix of the prompt is exactly the same byte-for-byte. A single re-ordering - say, the tool definitions getting listed in a different order on call #2 vs call #1 - breaks the cache and forces a full reprocess.
OpenAI specifically called out fixing non-deterministic tool ordering as a major optimisation when building Codex. The lesson for anyone wiring up MCP servers or custom tools: keep the order stable. It will materially affect speed and cost.
2. Context compaction. When the conversation gets long enough to start crowding the model's context window, the agent triggers a compaction step - the model is asked to summarise its own state into a compact representation. OpenAI's version, callable via /responses/compact, produces an opaque encrypted blob that's smaller than a text summary and (usefully) doesn't have to be human-readable to be useful to the next call. Anthropic's Claude Code does something similar.
The trade-off: the compaction is lossy. Whatever decision-making the agent did in turn 7 is now represented by a compressed sketch of turn 7, not the full transcript. You lose detail. But you stay under the context window, and the agent doesn't immediately fall over.
What is the 'agent harness' bit?
Around the agent loop is a layer of plumbing that handles the bits the loop itself doesn't care about: how to run a tool safely (sandboxing), how to stream output back to the user, how to persist a session, how to talk to many different client UIs at once. That layer is the 'harness'.
OpenAI's Codex harness has three primitives worth knowing:
- Item - the atomic unit of input or output. Starts, streams data, completes. Every message and tool call is an item.
- Turn - one full agent loop, from your input to the model's reply.
- Thread - the durable container that holds turns. You can leave it and resume later; the thread persists.
What's clever in OpenAI's design - and the same is broadly true for Claude Code - is that the harness is one shared library written in Rust, and the different clients (CLI, VS Code extension, web, macOS app, JetBrains, Xcode) all talk to it through the same protocol (JSON-RPC over stdio, called the App Server). It means every client has identical behaviour without code duplication, which is the difference between Codex feeling consistent across surfaces and the older generation of agents feeling slightly different in each tool.
What can I do with this knowledge?
Three practical takeaways for anyone using these tools day-to-day:
- Restart sessions sooner than feels necessary. The quadratic cost of long sessions is real. If you've been working with the same chat for an hour and noticing slow responses or odd mistakes, start a new session and paste in only the relevant context. You'll get faster answers and more accurate ones.
- Keep your custom tool / MCP setup stable. If you've added MCP servers to Claude Code or Cursor and you're getting unexpectedly slow responses, check that your tools are loading in the same order every time. Non-deterministic tool ordering is the silent killer of prompt-cache hit rates.
- Trust the compaction less than the full transcript. When an agent compacts your session and you continue, the model is now reasoning from a summarised version of what you discussed. For high-stakes decisions, paste the actual constraint or fact back in rather than relying on the agent to have retained it through compaction.
Is this all going to change?
The high-level loop won't. The model providers keep optimising the constants - faster prompt caching, smarter compaction, larger context windows - but the underlying O(n²) cost of the agent loop is intrinsic to the transformer architecture, and no amount of engineering polish removes it. What might shift is how aggressively the harness manages context behind the scenes; the Codex /responses/compact endpoint is one early example, and Anthropic's published 'memory' features are another.
For now: same loop, same trade-offs, same practical advice. Restart sooner, keep configs stable, don't trust compaction with the important details.
Frequently asked questions
Q01Is Claude Code's architecture the same as Codex's?
Q02Why does my long session sometimes contradict things it told me earlier?
Q03Does prompt caching apply to my paid ChatGPT or Claude subscription too?
Q04Are agents going to get cheaper as caching improves?
Q05Is there a way to see how many tokens my session has used?
Context Rot: Why Long AI Sessions Get Worse
ChatGPT vs Claude vs Gemini: Which Should You Use?