Building AI Agents That Survive Production

Most AI agents shipping today are held together with sellotape. Four architectural bets separate the production-ready ones from the prototypes.

Abstract architectural blueprint showing layered structures
Updated How we review →
Rob
By Rob11 June 2026 · 7 min read

The interesting failure mode of AI agents in 2026 isn't that the models are wrong. It's that the surrounding infrastructure is wrong, and the model gets blamed. Sessions time out mid-task, context windows fragment, credentials leak across boundaries, and the agent comes back looking dumber than it is. Web developer Addy Osmani published a piece in April 2026 outlining the four architectural bets a serious agent stack has to make. Reading it is a useful exercise even if you're not running production agents yet, because the bets describe what's about to be table stakes.

Why is this an architecture problem, not a model problem?

The default assumption in 2025 was that agents needed better models. Smarter, longer-context, more capable. That bet was half right. Models did get dramatically better; agents that used them still broke in production, just for different reasons. The failures clustered in four places that had nothing to do with model quality and everything to do with the plumbing around it.

That's a familiar pattern. Cloud computing went through it: the first generation of cloud apps used virtual machines as if they were physical servers and got eaten by reliability problems. The second generation accepted that cloud was a different substrate and built for it. AI agents are at the same crossover. Per Osmani's analysis, the teams shipping reliable agents in 2026 are the ones who've stopped treating the model as the product and started treating the surrounding stack as the product.

What are the four bets?

Identity, not borrowed credentials

Today most agents act as a service account, sharing one identity with every other agent in the company. When something goes wrong (an audit log shows a million-dollar transfer), nobody can tell which agent did it. The bet: give every agent its own unforgeable identity at the platform layer, recognised by your IAM and your audit tooling, so individual agent actions are traceable. Closes the 'ghost in the system' problem before regulators force you to.

Universal context, not scraped windows

An agent that has to reason across your CRM, your support tickets, your finance data, and your codebase currently does so by your engineering team writing custom plumbing for each system. Endless boilerplate, brittle integrations, and the agent only sees what the plumbing remembered to pass through. The bet: integrate context once at the platform layer (Model Context Protocol, enterprise data fabric) so the agent reasons across systems without you stitching JSON together by hand.

Surviving the session

Real workflows take hours or weeks (a procurement process, a software migration, a multi-stage onboarding). Most agents today have a context window measured in megabytes and a session lifetime measured in minutes. They hit a ceiling, lose state, and the human has to restart. The bet: durable execution with state checkpointing, long-horizon memory, and explicit human-in-the-loop gates so the agent picks up where it left off after a credential rotation, an outage, or simply a long weekend.

Platforms, not custom stacks

Every team building agents in 2025 wrote its own memory layer, its own observability, its own retry logic. That's the same waste-of-energy pattern that justified the move from bare metal to cloud. The bet: build on open primitives and managed platforms (Temporal, Restate, DBOS, LangGraph, the emerging MCP-based ecosystems) so your team can spend time on the part that's actually domain-specific. Solving 'agents need a memory layer' the eighteenth time is not a competitive advantage.

What does this mean for a small team?

If you're a solo builder or a small team, the four bets read like enterprise problems. Mostly they are. But two of them matter even at the smallest scale, and the other two are worth understanding before you commit to architecture decisions that will hurt later.

Identity matters from day one. If your agent calls an API on a user's behalf, the credentials need to look like that user's, not a generic service account. Get this wrong on a hobby project and the cost is a leaked key. Get it wrong on a paid product and the cost is a compliance event. Either way the fix is much cheaper at the start than at scale.

Session survival becomes urgent the moment your agent does anything that takes more than a few minutes. If your agent's longest task is one prompt-and-reply, the surrounding session story doesn't matter. The moment you ask it to process a queue, fill a form across multiple steps, or run an overnight job, you need state checkpointing or you'll be hand-restarting it daily.

The other two (universal context, platform stack) you can defer. Most hobbyists don't have an enterprise data fabric to integrate with, and the cost of using a custom stack is rounding-error at small scale. Revisit when the scale forces you to.

When is the simpler approach still right?

Three categories where the four-bet architecture is overkill.

Throwaway scripts. An agent that runs for ten minutes once a week to do a thing you'd otherwise do by hand doesn't need an identity layer or durable execution. A shell script with a one-line LLM call is the right answer. Don't over-engineer.

Interactive editor agents. Claude Code, Cursor, and similar in-editor agents already live inside your editor's session. They borrow your identity (your git config, your shell credentials) and lose state when you close the window, and that's correct for the interactive use case. The four bets matter for agents acting autonomously; interactive ones have a human in the loop already.

Prototypes you intend to throw away. If you're learning the space, building a quick demo, or testing whether an agent can do X at all, custom plumbing is fine. The four bets matter when you're committing to running the thing for years; before that point the architectural discipline is friction without payoff.

Frequently asked questions

Q01Are any of these solved problems today, or all still in flux?
Identity and session-survival have credible options shipping today (Temporal and Restate for durable execution, MCP for context, enterprise IAM systems for identity). Universal context across enterprise systems is still partly bespoke; platforms-not-stacks is a direction, not a product. Expect rapid consolidation across 2026-2027.
Q02Does this apply to hobby projects, or only enterprise?
Identity matters from day one. Session-survival kicks in once your agent does anything longer than a chat. The other two scale with the size of the integration surface, so they apply more to enterprise than to hobby projects. Don't over-engineer a weekend script.
Q03Which platform should I bet on for durable execution?
Temporal is the most mature, with the largest community. Restate is newer but designed for the AI agent use case. DBOS is interesting if you also want a database-native execution model. For experimentation, LangGraph + a hosted backend is the lowest friction. None is yet the obvious winner; pick the one whose docs you find most readable.
Q04What does "agent identity" actually look like in practice?
Concretely, each agent gets its own OAuth client ID, its own API tokens, its own row in your IAM system. When the agent acts on a user's behalf, it does so via a delegated-authority flow (the user grants the specific agent permission for the specific scope), not by holding the user's credentials. Audit logs show the agent, the user, and the scope every time.
Q05Is this going to make AI agents more expensive?
Up front, yes. The platforms charge for the orchestration; the identity layer adds engineering work; the durable execution costs more than ephemeral. Over a multi-year project the unit economics improve dramatically (you stop rebuilding the same plumbing) and the reliability gain is what makes the agent worth running at all. The investment is real; the alternative is shipping the agent twice.