When AI Writes the Code: Probabilistic Engineering
AI now writes huge slabs of software. Validation, not generation, is the new bottleneck - and that quiet shift changes how trust gets built.

A new phrase has been quietly making the rounds in software circles - probabilistic engineering. It captures something that has changed in the last two years without much fanfare. Code used to be written by people who knew what they were typing. Increasingly, it is written by AI assistants that produce something plausible, fast, and only sometimes correct.
If you do not write software for a living, you might wonder why any of this matters to you. It matters because the apps on your phone, the booking system at your dentist, and the online checkout at your supermarket are all increasingly built this way. Understanding the shift helps you make sense of why some things suddenly work brilliantly and other things suddenly break in baffling ways.
What does "probabilistic engineering" actually mean?
Until very recently, writing software was a deterministic activity. A developer typed instructions, the instructions ran, and the same input produced the same output every single time. If something went wrong, you could read the code, find the broken line, and fix it. The whole craft was built around the idea that humans could read what humans had written.
Probabilistic engineering describes what happens when that assumption breaks. AI coding tools - ChatGPT, Claude, Gemini, GitHub Copilot, Cursor - generate code based on statistical patterns. They have read more software than any human ever could, and they produce reasonable-looking output by predicting what comes next. Most of the time it works. Sometimes it does not, and the reason it failed is not always clear, even to the AI.
The useful analogy is the difference between writing an essay and proofreading one. Writing is generative - you put words on a page from scratch. Proofreading is validating - you check whether the words on the page make sense, are accurate, and say what they were meant to say. Both are real work. They use different muscles.
How is this different from how software used to be built?
Traditional software development was effort-heavy at the generation step. Writing a thousand lines of working code took a skilled developer days or weeks. Reviewing those lines, by comparison, took an hour. The ratio favoured careful generation followed by light review.
That ratio has flipped. A modern AI coding tool can produce a thousand lines in minutes. The generation step that used to take a week now takes a cup of tea. But the review step has not got any faster - in fact, it has got harder, because the human reviewer did not write the code and so cannot rely on having a clear mental model of what it does.
This is the practical heart of probabilistic engineering. We have made generation cheap without making validation any cheaper. The bottleneck moved, and most of the industry is still adjusting to where it ended up.
A useful way to picture it is a car factory. For decades the slow step was bolting parts together by hand. When robots took over the assembly, you might expect cars to roll off the line in seconds. They don't. The slow step moved to quality assurance (the people checking each car works before it leaves the factory). Speeding up assembly without speeding up QA just means more cars piling up at the inspection bay - or worse, more cars leaving the factory with faults nobody caught.
Why is validation now the bottleneck?
Three forces converge to make validation harder, not easier, in the AI era.
The volume has exceeded human attention. When a developer writes their own code, they can hold the whole thing in their head while they write. When the same developer reviews AI output at five times the speed, they cannot. By the time the third pull request lands, the first one is already a blur. Important details slip past not through laziness but through bandwidth.
According to established work on software quality, defect cost rises sharply the later a bug is found. A bug caught while writing costs minutes. The same bug in production costs hours of debugging, an incident report, and sometimes a public apology. The economics of validation reward catching things early, which is precisely where AI-assisted workflows are weakest.
Models reviewing models miss plenty. One common response is to ask an AI to review AI output. This works for stylistic issues and obvious mistakes. It works much less well for subtle logic errors, security holes, or anything that requires real understanding of the system the code is being added to. A model that confidently wrote a bug is unlikely to confidently flag the same bug.
Context is finite. Even the most capable AI assistants can only hold so much of your codebase in mind at once. The further a change ripples through a system, the more likely it is that the AI has missed something elsewhere that depends on the bit being changed. We explored this in detail in our piece on context rot, but the short version is that AI's awareness of a system fades the larger the system gets.
What does this mean if you don't write code?
Even if you have never opened a code editor, the shift to probabilistic engineering shows up in the software you already use. Three patterns are worth noticing.
Things get built faster. Features that would have taken a small startup six months can ship in six weeks. This is genuinely good. It means small teams can compete with large ones, niche tools get built that nobody could have funded before, and your favourite app gets the feature you've been asking for sooner than you expected.
Things break in unfamiliar ways. When a bug used to slip through, you could often guess what had happened - someone forgot a check, a number was wrong somewhere. AI-introduced bugs tend to look weirder. The code does something that is almost right, in a way that suggests the AI understood 90% of the problem and made up the last 10%. If your bank app suddenly displays your balance in dollars one Tuesday morning, the answer is probably probabilistic.
Quality is now more about review culture than headcount. The teams that ship reliable software now are not necessarily the ones with the most engineers. They are the ones with the strongest habits around deciding what to trust, testing aggressively, and being honest when validation has slipped. This is good news for small teams who take quality seriously. It is bad news for organisations that hoped AI would let them cut corners.
Where does it go wrong in practice?
Several failure patterns recur often enough that they deserve names.
Plausible-looking nonsense. The AI generates code that compiles, runs, passes tests, and quietly does the wrong thing under specific conditions nobody tested. This is the classic hallucination problem dressed up in technical clothing - confident output that is just confidently wrong.
Reviewer fatigue. A team agrees to review every AI-generated change. They mean it. By week four, exhausted, they are skimming. The decline is gradual enough that nobody notices until something embarrassing reaches production.
Integration drift. Each individual change looks fine. The sum of changes is no longer coherent. Pieces of the codebase quietly disagree about how a thing should work, because the AI that wrote piece A had different context to the AI that wrote piece B, and the human reviewer of each missed the disagreement.
Test theatre. AI is excellent at writing tests that pass. It is much less excellent at writing tests that would actually catch a bug. A codebase can rapidly accumulate hundreds of confidence-inducing tests that prove almost nothing.
None of these failure modes are new in principle. Skilled humans have been making versions of these mistakes for as long as software has existed. What is new is the rate. Generation got cheap; mistakes got cheap to make in bulk.
Should non-engineers worry about this?
Worry is the wrong frame. The right frame is awareness.
Worth knowing: most software you rely on day-to-day will continue working most of the time. The systems that handle money, medical records, or anything safety-critical are built with deeper validation layers (regulators, auditors, formal testing) than the average startup app. They are slower-moving and harder to disrupt, which is exactly the point.
Worth doing: be a bit more sceptical of brand-new features in apps you depend on. Wait a fortnight before trusting a new banking integration with your salary. Read reviews of new AI-powered tools rather than buying on launch day. The probabilistic-engineering era rewards patience.
Worth ignoring: the headlines claiming AI is about to either solve all bugs or destroy all software. Neither is happening. The reality is more mundane and more interesting - we are working out, in public, how to do quality assurance for a kind of work that did not exist three years ago. That is going to take a few more iterations.
Tim Davis's essay on probabilistic engineering coined the framing we have used in this piece. The term is useful precisely because it points at a real shift without pretending the shift is wholly good or wholly bad.
Frequently asked questions
Q01Is probabilistic engineering the same as vibe coding?
Q02Does this mean software jobs are going away?
Q03How can I tell if an app I use is built this way?
Q04Is AI-written code less secure than human-written code?
Q05Will this get better over time?
What Actually Happens Inside Claude Code or Cursor
Context Rot: Why Long AI Sessions Get Worse
Why AI Hasn't Replaced Human Experts (Yet)