When AI Writes the Code: Probabilistic Engineering

AI now writes huge slabs of software. Validation, not generation, is the new bottleneck - and that quiet shift changes how trust gets built.

Abstract circuit board pattern representing software systems

Updated 11 June 2026 How we review →

By Rob11 June 2026 · 11 min read

A new phrase has been quietly making the rounds in software circles - probabilistic engineering. It captures something that has changed in the last two years without much fanfare. Code used to be written by people who knew what they were typing. Increasingly, it is written by AI assistants that produce something plausible, fast, and only sometimes correct.

If you do not write software for a living, you might wonder why any of this matters to you. It matters because the apps on your phone, the booking system at your dentist, and the online checkout at your supermarket are all increasingly built this way. Understanding the shift helps you make sense of why some things suddenly work brilliantly and other things suddenly break in baffling ways.

What does "probabilistic engineering" actually mean?

Until very recently, writing software was a deterministic activity. A developer typed instructions, the instructions ran, and the same input produced the same output every single time. If something went wrong, you could read the code, find the broken line, and fix it. The whole craft was built around the idea that humans could read what humans had written.

Probabilistic engineering describes what happens when that assumption breaks. AI coding tools - ChatGPT, Claude, Gemini, GitHub Copilot, Cursor - generate code based on statistical patterns. They have read more software than any human ever could, and they produce reasonable-looking output by predicting what comes next. Most of the time it works. Sometimes it does not, and the reason it failed is not always clear, even to the AI.

The useful analogy is the difference between writing an essay and proofreading one. Writing is generative - you put words on a page from scratch. Proofreading is validating - you check whether the words on the page make sense, are accurate, and say what they were meant to say. Both are real work. They use different muscles.

How is this different from how software used to be built?

Traditional software development was effort-heavy at the generation step. Writing a thousand lines of working code took a skilled developer days or weeks. Reviewing those lines, by comparison, took an hour. The ratio favoured careful generation followed by light review.

That ratio has flipped. A modern AI coding tool can produce a thousand lines in minutes. The generation step that used to take a week now takes a cup of tea. But the review step has not got any faster - in fact, it has got harder, because the human reviewer did not write the code and so cannot rely on having a clear mental model of what it does.

This is the practical heart of probabilistic engineering. We have made generation cheap without making validation any cheaper. The bottleneck moved, and most of the industry is still adjusting to where it ended up.

A useful way to picture it is a car factory. For decades the slow step was bolting parts together by hand. When robots took over the assembly, you might expect cars to roll off the line in seconds. They don't. The slow step moved to quality assurance (the people checking each car works before it leaves the factory). Speeding up assembly without speeding up QA just means more cars piling up at the inspection bay - or worse, more cars leaving the factory with faults nobody caught.

Why is validation now the bottleneck?

Three forces converge to make validation harder, not easier, in the AI era.

The volume has exceeded human attention. When a developer writes their own code, they can hold the whole thing in their head while they write. When the same developer reviews AI output at five times the speed, they cannot. By the time the third pull request lands, the first one is already a blur. Important details slip past not through laziness but through bandwidth.

According to established work on software quality, defect cost rises sharply the later a bug is found. A bug caught while writing costs minutes. The same bug in production costs hours of debugging, an incident report, and sometimes a public apology. The economics of validation reward catching things early, which is precisely where AI-assisted workflows are weakest.

Models reviewing models miss plenty. One common response is to ask an AI to review AI output. This works for stylistic issues and obvious mistakes. It works much less well for subtle logic errors, security holes, or anything that requires real understanding of the system the code is being added to. A model that confidently wrote a bug is unlikely to confidently flag the same bug.

Context is finite. Even the most capable AI assistants can only hold so much of your codebase in mind at once. The further a change ripples through a system, the more likely it is that the AI has missed something elsewhere that depends on the bit being changed. We explored this in detail in our piece on context rot, but the short version is that AI's awareness of a system fades the larger the system gets.

What does this mean if you don't write code?

Even if you have never opened a code editor, the shift to probabilistic engineering shows up in the software you already use. Three patterns are worth noticing.

Things get built faster. Features that would have taken a small startup six months can ship in six weeks. This is genuinely good. It means small teams can compete with large ones, niche tools get built that nobody could have funded before, and your favourite app gets the feature you've been asking for sooner than you expected.

Things break in unfamiliar ways. When a bug used to slip through, you could often guess what had happened - someone forgot a check, a number was wrong somewhere. AI-introduced bugs tend to look weirder. The code does something that is almost right, in a way that suggests the AI understood 90% of the problem and made up the last 10%. If your bank app suddenly displays your balance in dollars one Tuesday morning, the answer is probably probabilistic.

Quality is now more about review culture than headcount. The teams that ship reliable software now are not necessarily the ones with the most engineers. They are the ones with the strongest habits around deciding what to trust, testing aggressively, and being honest when validation has slipped. This is good news for small teams who take quality seriously. It is bad news for organisations that hoped AI would let them cut corners.

Where does it go wrong in practice?

Several failure patterns recur often enough that they deserve names.

Plausible-looking nonsense. The AI generates code that compiles, runs, passes tests, and quietly does the wrong thing under specific conditions nobody tested. This is the classic hallucination problem dressed up in technical clothing - confident output that is just confidently wrong.

Reviewer fatigue. A team agrees to review every AI-generated change. They mean it. By week four, exhausted, they are skimming. The decline is gradual enough that nobody notices until something embarrassing reaches production.

Integration drift. Each individual change looks fine. The sum of changes is no longer coherent. Pieces of the codebase quietly disagree about how a thing should work, because the AI that wrote piece A had different context to the AI that wrote piece B, and the human reviewer of each missed the disagreement.

Test theatre. AI is excellent at writing tests that pass. It is much less excellent at writing tests that would actually catch a bug. A codebase can rapidly accumulate hundreds of confidence-inducing tests that prove almost nothing.

None of these failure modes are new in principle. Skilled humans have been making versions of these mistakes for as long as software has existed. What is new is the rate. Generation got cheap; mistakes got cheap to make in bulk.

Should non-engineers worry about this?

Worry is the wrong frame. The right frame is awareness.

Worth knowing: most software you rely on day-to-day will continue working most of the time. The systems that handle money, medical records, or anything safety-critical are built with deeper validation layers (regulators, auditors, formal testing) than the average startup app. They are slower-moving and harder to disrupt, which is exactly the point.

Worth doing: be a bit more sceptical of brand-new features in apps you depend on. Wait a fortnight before trusting a new banking integration with your salary. Read reviews of new AI-powered tools rather than buying on launch day. The probabilistic-engineering era rewards patience.

Worth ignoring: the headlines claiming AI is about to either solve all bugs or destroy all software. Neither is happening. The reality is more mundane and more interesting - we are working out, in public, how to do quality assurance for a kind of work that did not exist three years ago. That is going to take a few more iterations.

Tim Davis's essay on probabilistic engineering coined the framing we have used in this piece. The term is useful precisely because it points at a real shift without pretending the shift is wholly good or wholly bad.

Frequently asked questions

Q01Is probabilistic engineering the same as vibe coding?

They overlap but are not identical. Vibe coding usually refers to a casual style where the human steers an AI by intuition rather than specification, often for hobby or prototype work. Probabilistic engineering is the broader phenomenon - the entire industry adjusting to AI-generated code as a normal input, including in serious production systems where vibe coding alone would be irresponsible.

Q02Does this mean software jobs are going away?

No. The work is shifting, not vanishing. Generation jobs are getting compressed - one person can now do what a small team used to. Validation, system design, and judgement work are growing. The Office for National Statistics flagged software development as one of the UK's faster-growing roles in its most recent labour-market projections.

Q03How can I tell if an app I use is built this way?

You usually cannot tell directly, but you can spot the signs. Apps that ship updates very frequently, sometimes daily, are likely using AI tooling heavily. Smaller startups in particular have adopted these workflows quickly. The pattern of "works brilliantly in one area, breaks oddly in another" is also a soft tell.

Q04Is AI-written code less secure than human-written code?

It depends on what you are comparing. AI is good at avoiding common, well-documented security mistakes - the kind that show up in every textbook. AI is less reliable on novel or subtle security issues that require understanding the threat model. Mature teams use both AI generation and human security review. Lone developers shipping AI output unchecked is the higher-risk pattern.

Q05Will this get better over time?

Probably yes, but not because the AI gets perfect. It will get better because teams develop better habits - smaller change sizes, stronger automated checks, clearer ownership, and a healthier scepticism about output that looks fine. The pattern is similar to how the early web matured from "works in one browser only" chaos into something dependable.

What Actually Happens Inside Claude Code or Cursor

When AI Writes the Code: Probabilistic Engineering

What does "probabilistic engineering" actually mean?

How is this different from how software used to be built?

Why is validation now the bottleneck?

What does this mean if you don't write code?

Where does it go wrong in practice?

Should non-engineers worry about this?

Frequently asked questions

What Actually Happens Inside Claude Code or Cursor

Context Rot: Why Long AI Sessions Get Worse

Why AI Hasn't Replaced Human Experts (Yet)

When to Trust AI Answers

When AI Writes the Code: Probabilistic Engineering

What does "probabilistic engineering" actually mean?

How is this different from how software used to be built?

Why is validation now the bottleneck?

What does this mean if you don't write code?

Where does it go wrong in practice?

Should non-engineers worry about this?

Frequently asked questions

Related guides

What Actually Happens Inside Claude Code or Cursor

Context Rot: Why Long AI Sessions Get Worse

Why AI Hasn't Replaced Human Experts (Yet)

When to Trust AI Answers