Context Engineering for AI Agents: Practical Patterns

How to engineer an AI agent's context: instructions, tools, examples, memory and retrieval - and how the rules change from 1k to 1M tokens.

Layered network of connected nodes representing how context is assembled for an AI agent
Updated How we review →
Rob
By Rob18 June 2026 · 10 min read

Most people who are disappointed by an AI agent blame the model. The real problem is usually upstream: the agent was handed the wrong information, too much of it, or in the wrong order. Fixing that is a discipline of its own, and it has a name - context engineering. This guide covers the patterns that actually move the needle, and how they shift as your context window grows from a thousand tokens to a million.

What is context engineering?

Context engineering is the practice of curating everything that goes into a model's context window (the fixed token budget a model can read at once, where a token is a chunk of text very roughly three-quarters of a word) on each turn of an agent's loop. That includes the system instructions, the tool definitions, any worked examples, the running conversation, retrieved documents and the agent's own memory.

It is a broader idea than prompt engineering, which is about wording a single request well. Prompt engineering asks "how do I phrase this question?" Context engineering asks "what should the model be able to see when it answers, and what should it not?" For a one-shot chatbot reply the two overlap. For an agent that runs dozens of steps, calls tools and accumulates history, context engineering is the larger lever - and it is why a well-fed average model routinely outperforms a frontier model working from a cluttered prompt. I have made the strategic case for that separately in why context engineering beats picking a smarter model; this piece is the hands-on companion.

Why does the context window fill up so fast?

Every step an agent takes adds tokens. A single tool call appends the request, the raw result (often verbose JSON), and the model's reasoning about it. Run an agent for twenty steps and you can burn six figures of tokens on plumbing before the model has done anything useful.

Two problems follow. The first is cost and latency: you pay for every token read, on every step. The second is subtler and more dangerous - models attend less reliably to information buried in the middle of a long context, a failure mode documented in the research literature as the lost-in-the-middle effect and one I have written about as context rot. A bigger window does not make this go away. It just lets you make the mistake at greater scale.

The five things you are actually engineering

It helps to break the context into the five components you control. Get each one right and most agent failures disappear.

01

Instructions

The system prompt: the agent's role, constraints, output format and rules of engagement. Keep it declarative and specific. Vague instructions ('be helpful') waste tokens; concrete ones ('always return valid JSON matching this schema; never invent an order ID') change behaviour.

02

Tools

The functions the agent can call, and crucially their descriptions. A tool the model cannot understand is a tool it will misuse. Treat each tool description as a mini spec, not an afterthought.

03

Examples

Worked examples (few-shot prompting) that show the model the shape of a good answer or a correct tool call. Two or three sharp examples often beat a paragraph of abstract instruction.

04

Memory

What the agent carries forward between steps and sessions: decisions made, facts learned, the user's preferences. Without a memory strategy, a long-running agent forgets what it already did and repeats itself.

05

Retrieval

Pulling in external knowledge at the moment it is needed rather than pre-loading everything. This is where retrieval-augmented generation (RAG - fetching relevant documents at query time and adding them to the prompt) earns its keep.

How do you write a tool description an agent can use?

Tool descriptions are the most under-rated surface in agent design. The model decides whether and how to call a tool almost entirely from its name and description, so write them for the model, not for a human skim-reader.

Three rules carry most of the weight. State when to use the tool, not just what it does - 'use this to look up a live order status by ID; do not use it for historical orders older than 90 days'. Spell out each parameter's format and constraints inline. And describe the shape of what comes back, so the model knows what it can chain next. Anthropic's tool-use documentation and OpenAI's function-calling guide both make the same point: the description is part of the prompt, and ambiguous descriptions are the single most common cause of wrong tool calls.

When should you retrieve instead of stuffing the context?

The instinct with a large context window is to paste everything in - the whole knowledge base, the full document, every prior message. Resist it. Retrieval almost always beats stuffing once your reference material is larger than a few thousand tokens, for three reasons: you keep the high-signal-to-noise ratio that protects attention, you cut cost on every step, and you can update the knowledge without re-engineering the prompt.

The decision rule I use: if the information is small, stable and needed on every step, put it in the system prompt. If it is large, changes often, or is only relevant to some queries, retrieve it on demand. I cover the mechanics in what is RAG, but the context-engineering point is simply this: retrieval is how you keep a context window focused even when the underlying knowledge is enormous.

How does the strategy change from 1k to 100k to 1M tokens?

This is the question that trips people up, because the right answer inverts as the budget grows. Today's frontier models from Anthropic, Google and OpenAI offer context windows up to a million tokens or more, and it is tempting to assume that makes context engineering obsolete. The opposite is true - it changes which problem you are solving.

At around 1k-8k tokens, the constraint is space. You hand-curate ruthlessly: tight instructions, one or two examples, only the most relevant retrieved snippet. Every token is contested, so you spend your effort on compression.

At around 100k tokens, space stops being the binding constraint and attention becomes one. You can fit a lot, but the lost-in-the-middle effect means the model will not weight all of it equally. Now you spend your effort on ordering and structure - putting the most important material at the start and end, using clear headings, and summarising stale history rather than carrying it verbatim.

At 1M tokens, you can fit an entire codebase or a book, but two new problems appear: cost and latency scale with what you load, and precision still matters because more irrelevant text means more chances to distract the model. The winning pattern at this scale is not 'load everything' - it is 'retrieve aggressively, then use the huge window as headroom for the genuinely relevant material plus room to reason'. The window grew; the discipline did not change.

How do you stop a long-running agent from forgetting?

Memory is where agents most visibly fall apart, because the naive approach - keep appending the full history - guarantees you hit the context limit and the lost-in-the-middle wall at the same time. The fix is to treat memory as a managed resource rather than an ever-growing log.

Three patterns do most of the work. Summarise older turns into a compact running state so the agent keeps the decisions without the transcript. Externalise durable facts - user preferences, task progress, learned constraints - into a store the agent reads from and writes to, rather than holding them in the conversation. And scope each step to only the memory it needs, rather than re-reading everything every time. I unpack the failure mode in why your AI keeps forgetting things; the engineering answer is that a good agent decides what to remember as deliberately as it decides what to do.

What are the most common context-engineering mistakes?

The same handful of errors come up again and again. Dumping raw tool output (especially full JSON) straight back into the context instead of extracting the relevant field. Writing tool descriptions for humans rather than the model. Carrying the entire conversation history verbatim when a summary would do. Pre-loading a knowledge base that should be retrieved on demand. And treating a bigger context window as permission to stop curating. Each one quietly degrades the agent while leaving the model itself untouched - which is exactly why blaming the model is so tempting and so wrong.

Frequently asked questions

Q01Is context engineering the same as prompt engineering?
No. Prompt engineering is about wording a single request well. Context engineering is the broader discipline of deciding everything the model can see on each step - instructions, tools, examples, memory and retrieved data - which matters far more for multi-step agents than for one-shot chat.
Q02Does a 1M-token context window make context engineering unnecessary?
No. A larger window removes the space constraint but not the attention, cost or latency constraints. Models still weight information unevenly across a long context, so precision and ordering matter as much at 1M tokens as compression matters at 1k.
Q03Should I put my knowledge base in the system prompt or retrieve it?
Put information in the system prompt only if it is small, stable and needed on every step. If it is large, changes often, or is relevant only to some queries, retrieve it on demand so the context stays focused and cheap.
Q04Why does my agent forget what it did earlier?
Because it is carrying the full history verbatim until the important details get buried or pushed out of the window. Summarise old turns, externalise durable facts to a store, and give each step only the memory it needs.