Generating Unit Tests with AI: What Works, What Doesn't
AI coding assistants can churn out unit tests in minutes, but the shallow ones give you a false sense of confidence. Here's how to do it well.

Most codebases have a testing problem and most developers know it. Writing unit tests is the part of the job nobody enjoys, so it doesn't get done, so coverage decays, so bugs ship. AI coding assistants are genuinely good at the parts of test-writing that people procrastinate over (setup, mocks, the happy path), which means there's a real productivity win here. The trap is that the same tools can also produce shallow tests that pass for the wrong reasons and give you confidence you haven't earned.
Why have we been so bad at writing tests?
Three reasons, all of which the AI shortcut directly addresses.
Setup overhead. A new test file usually means importing the thing being tested, importing the test framework, building fake input, building a fake context (mocks, stubs, fixtures), and writing the assertion. Of those five steps, four are pure typing. If the typing takes longer than the thinking, most people skip the typing.
Coverage feels infinite. Once you start writing tests, you can always write more. There's no natural stopping point, no obvious priority order, and the marginal value of test N+1 is hard to feel. So people start, get tired, and stop somewhere arbitrary.
Tests don't pay for themselves on day one. The bug a test would have caught hasn't happened yet. The cost of writing the test is right now; the benefit is in some future incident you won't be sure was averted. That payoff structure makes tests easy to defer indefinitely.
How do AI assistants actually help?
Three concrete wins. The first is sheer typing speed. Asking GitHub Copilot, Claude Code or Cursor to "write unit tests for this file" produces a working test file with imports, setup, and the obvious happy-path assertions in a few seconds. That's the part nobody wanted to write anyway.
The second is suggestion of cases you wouldn't have thought of. A good assistant doesn't just test the happy path. It generates cases for empty input, null input, large input, the documented error condition, the off-by-one boundary. Most of those are obvious in hindsight; the value is that the AI lists them without you having to remember each one.
The third is conformance to your project's conventions. If your codebase uses a specific testing library, naming pattern, or fixture style, the AI matches it after seeing a few examples. That keeps the new tests readable next to the old ones, which matters for whoever maintains them later (often future you).
A practical workflow that works
Four steps, ordered. The first one is the most important and the most often skipped.
Write a project-level instructions file
Whatever your AI tool calls it (AGENTS.md for GitHub Copilot, CLAUDE.md for Claude Code, .cursorrules for Cursor), create one. Put in: the testing framework you use, the naming convention for test files, the fixture pattern, what mocks look like, what kinds of cases you want covered (boundary, error, happy), and what you do NOT want (snapshot-only tests, mocked-everything tests, tests that just restate the implementation). The AI then uses this every time without you having to repeat the brief.
Generate tests for one file at a time
Not the whole module. Not the whole codebase. One file. The AI does better with focused context, you can review the output in one sitting, and you avoid the trap of merging 50 files of tests you haven't read.
Read every generated test before merging
Specifically check: does the test assert behaviour, or does it assert implementation? A test that checks 'calculateTotal(items) returns 42 for these items' is testing behaviour. A test that checks 'calculateTotal was called with these arguments' is testing implementation. The first survives a refactor; the second breaks the moment you change anything.
Add the bug-driven tests yourself
Every bug you've ever fixed deserves a regression test. The AI doesn't know about those bugs; only you do. After the AI has generated the obvious tests, scroll through the file's git history (or your incident log) and write the tests that prevent the bugs that have actually happened. These are the high-value tests.
How do I tell a good test from a shallow one?
Three smells that show up consistently in AI-generated tests, with quick rewrites for each.
Tests that mirror the implementation. If the test reads like a translation of the function under test (every branch in the function has a corresponding test of the same branch, expressed the same way), it's not testing behaviour, it's just running the code. The rewrite: state the expected output in plain terms before looking at the code. If the test still passes against your stated expectation, keep it. If not, the test was wrong.
Over-mocked tests. A test that mocks every dependency tests almost nothing. The behaviour you care about lives in the interactions between things; if you mock those out, you're testing that the mocks work. The rewrite: mock the slow, expensive, or non-deterministic things (databases, network, time). Let the cheap, fast, deterministic things run for real.
Tests that only assert no exception was thrown. AI sometimes writes a test that calls the function and then has no assertion beyond "didn't crash". That's a smoke test pretending to be a unit test. The rewrite: state what the function should return or do, and assert that specific thing. "Doesn't crash" is rarely the contract you actually want to enforce.
When should I still write the tests by hand?
Three categories where the AI shortcut backfires.
Tests of subtle business logic. If the function encodes a rule that took a human three meetings to nail down (a tax-calculation edge case, a refund-eligibility window, a permissions matrix), the AI cannot generate good tests for it because it doesn't have the meetings. Hand-write the tests; they're documentation of intent as much as they are tests.
Regression tests for bugs you've already fixed. The AI has no memory of the bug. Only you (or your incident report) know what specifically went wrong, what input triggered it, and what the right output should have been. Always hand-write these.
Integration and end-to-end tests. AI assistants are great at unit tests because the scope is small and the answer is determinable from the code. Integration tests depend on external systems, fixtures, environment state. The AI's lack of visibility into those is the same problem as on frontend code: it can write the structure, but it can't tell you whether the test actually exercises what you think it does. Hand-write the assertion; let AI fill in the boilerplate.
Free AI Tools You Should Be Using in 2026
20 Actually Useful Things to Ask ChatGPT