What Is Pass@k? Why AI Benchmarks Mislead Users
Pass@k is the metric AI labs use to make code-writing AIs look great. Here's why the score you see is more forgiving than you'll ever be.

When you see headlines like 'GPT-5 hits 90% on HumanEval', the score quoted is almost always pass@k for some k larger than 1. The number sounds astonishing - and it is, mathematically. But the gap between that number and the experience of actually using the AI is wider than most write-ups admit.
How does pass@k actually work?
The maths in plain English.
Pass@k (the AI evaluation metric introduced in the 2021 OpenAI Codex paper) measures the probability that at least one of k generated attempts solves the problem correctly. The formal definition uses an unbiased estimator over n samples per problem; the intuition is simpler than the formula.
Imagine a coin that comes up heads 30% of the time. If you flip it once, you're right 30% of the time. Flip it 10 times and ask 'did at least one come up heads?' - the answer is yes about 97% of the time. Flip it 100 times and the chance of getting at least one head is essentially 100%.
That's pass@k for an AI. Pass@1 is one attempt. Pass@10 is ten attempts. The 'success' threshold is whether the test suite (HumanEval uses 164 Python problems with unit tests) passes on at least one attempt.
Why pass@k is exponentially forgiving
The two assumptions hidden in the number.
The metric makes two assumptions that hold in the benchmark setting but rarely in real use:
- You can run k attempts cheaply. Generating 100 candidate solutions to one problem is fine for an evaluation run. It's not fine when you're paying per-token, waiting at a chat prompt, or running an autonomous agent.
- You have a perfect oracle. HumanEval problems ship with a hidden test suite that grades each attempt with no ambiguity. The score counts an attempt as 'correct' if the tests pass. In real coding work, knowing whether the answer is correct is most of the work - and you don't get a free oracle.
Strip away those assumptions and pass@k stops being a useful predictor of what you'll experience. It becomes a description of what an automated retry-and-verify system could theoretically achieve given infinite cheap retries and a perfect grader.
What metric should you actually care about?
Three signals that match real-world use.
If you're picking an AI tool for everyday work, the headline pass@k number is the wrong question. The questions worth asking:
- Pass@1 (or 'greedy decoding accuracy'). One try, no retries, no cherry-picking. This is closest to what you experience as a user.
- Consistency across reruns. If you ask the same question 5 times and get 5 different answers, the model isn't 'mostly right' - it's unreliable for any task where you need a stable answer.
- Task-completion rate on realistic work. Benchmarks like SWE-Bench Verified (real GitHub issues) capture this better than synthetic puzzles. Look for those numbers when they're reported - they're typically half or less of HumanEval pass@k figures.
Bottom line
How to read AI benchmark numbers.
When a benchmark headline shouts pass@10 or pass@100, mentally discount the score. The model probably gets the answer right on its first try about a quarter to a half as often as the headline suggests. That's still progress year over year - GPT-4 era pass@1 in the 60-70% range on HumanEval is real - but the gap between 'sometimes brilliant' and 'consistently right' is where most agentic AI promises break down today.
If you find yourself running an AI through 10 different framings of the same question to get the answer you trust, you're doing pass@k by hand. The fact that a researcher could do the same and the model would eventually be right isn't a comforting reframe of your experience.