What Is Pass@k? Why AI Benchmarks Mislead Users

Pass@k is the metric AI labs use to make code-writing AIs look great. Here's why the score you see is more forgiving than you'll ever be.

Statistical probability dice illustrating pass at k benchmark concept
Updated How we review →
Rob
By Rob16 June 2026 · 5 min read

When you see headlines like 'GPT-5 hits 90% on HumanEval', the score quoted is almost always pass@k for some k larger than 1. The number sounds astonishing - and it is, mathematically. But the gap between that number and the experience of actually using the AI is wider than most write-ups admit.

How does pass@k actually work?

The maths in plain English.

Pass@k (the AI evaluation metric introduced in the 2021 OpenAI Codex paper) measures the probability that at least one of k generated attempts solves the problem correctly. The formal definition uses an unbiased estimator over n samples per problem; the intuition is simpler than the formula.

Imagine a coin that comes up heads 30% of the time. If you flip it once, you're right 30% of the time. Flip it 10 times and ask 'did at least one come up heads?' - the answer is yes about 97% of the time. Flip it 100 times and the chance of getting at least one head is essentially 100%.

That's pass@k for an AI. Pass@1 is one attempt. Pass@10 is ten attempts. The 'success' threshold is whether the test suite (HumanEval uses 164 Python problems with unit tests) passes on at least one attempt.

Why pass@k is exponentially forgiving

The two assumptions hidden in the number.

The metric makes two assumptions that hold in the benchmark setting but rarely in real use:

  • You can run k attempts cheaply. Generating 100 candidate solutions to one problem is fine for an evaluation run. It's not fine when you're paying per-token, waiting at a chat prompt, or running an autonomous agent.
  • You have a perfect oracle. HumanEval problems ship with a hidden test suite that grades each attempt with no ambiguity. The score counts an attempt as 'correct' if the tests pass. In real coding work, knowing whether the answer is correct is most of the work - and you don't get a free oracle.

Strip away those assumptions and pass@k stops being a useful predictor of what you'll experience. It becomes a description of what an automated retry-and-verify system could theoretically achieve given infinite cheap retries and a perfect grader.

What metric should you actually care about?

Three signals that match real-world use.

If you're picking an AI tool for everyday work, the headline pass@k number is the wrong question. The questions worth asking:

  • Pass@1 (or 'greedy decoding accuracy'). One try, no retries, no cherry-picking. This is closest to what you experience as a user.
  • Consistency across reruns. If you ask the same question 5 times and get 5 different answers, the model isn't 'mostly right' - it's unreliable for any task where you need a stable answer.
  • Task-completion rate on realistic work. Benchmarks like SWE-Bench Verified (real GitHub issues) capture this better than synthetic puzzles. Look for those numbers when they're reported - they're typically half or less of HumanEval pass@k figures.

Bottom line

How to read AI benchmark numbers.

When a benchmark headline shouts pass@10 or pass@100, mentally discount the score. The model probably gets the answer right on its first try about a quarter to a half as often as the headline suggests. That's still progress year over year - GPT-4 era pass@1 in the 60-70% range on HumanEval is real - but the gap between 'sometimes brilliant' and 'consistently right' is where most agentic AI promises break down today.

If you find yourself running an AI through 10 different framings of the same question to get the answer you trust, you're doing pass@k by hand. The fact that a researcher could do the same and the model would eventually be right isn't a comforting reframe of your experience.

Q01What does pass@1 mean?
Pass@1 is the probability that an AI gets the answer right on its first attempt, with no retries. It's the most realistic predictor of what you'll experience as a user. Headline 'pass@k' numbers for k > 1 are systematically higher because they let the model try k times.
Q02Why do AI labs report pass@10 or pass@100 instead of pass@1?
Because the numbers look much better. A 30% pass@1 model rounds to ~97% pass@10 and essentially 100% pass@100. The metric is mathematically sound (it shows what's possible with retries + an oracle) but it's marketed as if it's what you'll experience, which it usually isn't.
Q03Is pass@k useful for anything?
Yes - for AI agents that can verify their own output (running unit tests, checking maths, validating against an API) and have the budget to retry. In those systems pass@10 or pass@100 with an oracle is genuinely close to user-perceived success. Coding agents like Devin or Cursor background tasks fit this pattern. Consumer chat doesn't.
Q04What's a better benchmark than HumanEval pass@k?
SWE-Bench Verified (real GitHub issues resolved end-to-end) is much closer to real-world software engineering than HumanEval's 164 hand-crafted Python puzzles. Pass@1 on SWE-Bench Verified for frontier models in early 2026 sits in the 30-50% range - meaningfully lower than HumanEval pass@10 numbers in the 90s.