Why 'price per million tokens' lies about your AI bill
The sticker price on AI APIs hides a sneaky multiplier. Here's how tokenizer efficiency quietly inflates real bills, and how to compare like for like.

If you have ever pulled up a side by side of AI API prices, you have probably done what everyone does. Eye up the column that says 'price per million tokens', pick the cheapest one, and feel quietly clever about it.
I have to break some bad news. That column lies. Not on purpose, but it lies all the same.
The price tag is honest about one thing: how much the provider charges per million of their tokens. The catch is that every provider counts tokens differently. The same paragraph of English, the same chunk of code, the same JSON blob, will get chopped into very different numbers of tokens depending on whose model you send it to. So the cheaper sticker price can quietly turn into a fatter bill.
The supermarket unit-price problem
Here is the easiest analogy. Imagine you are at the supermarket comparing two packs of porridge oats. One says £1.50, the other says £1.80. Obviously you grab the cheaper one. Then you get home and realise the £1.50 pack was 500 grams and the £1.80 pack was 1 kilogram. Suddenly the 'more expensive' option is the better deal by a country mile.
That is why every UK supermarket is legally required to print a unit price (the cost per 100g, per litre, per kilo) next to the sticker. It is the only number that lets you compare like for like.
LLM pricing does not have that. The 'unit' is a token, but every provider defines a token slightly differently. So the per million token price is the sticker on the front of the pack, not the price per kilo.
What on earth is a token, in 30 seconds
A tokenizer is the bit of code that chops your text into the chunks a model actually sees. It is not words, and it is not letters. It is something in between, decided by the way the provider trained their model.
One token can be a whole short word, a chunk of a longer word, a piece of punctuation, or even a few characters of code or JSON. The same sentence can become 18 tokens on one provider and 27 on another, just because their tokenizers slice the language differently.
You can mostly ignore the mechanics. What you cannot ignore is the consequence: send the same prompt to two providers, you get billed for two different token counts.
How big is the gap, really?
It depends entirely on what you are sending. Plain English is the kindest content; the tokenizers are reasonably close. Structured stuff like JSON, YAML and tool definitions is where the gap opens up.
An analysis by TensorZero put some numbers on this. Using OpenAI's tokenizer as a 1.00x baseline:
- Claude Opus 4.7 produces 1.57x more tokens on plain text, 1.53x on YAML, 1.70x on JSON, and 2.65x on tool definitions.
- Gemini 3.1 Pro produces 1.06x on text, 1.18x on YAML, 1.11x on JSON, and 1.82x on tools.
So for an English chat conversation, the gap is small enough that you can almost ignore it. For a coding agent that is constantly passing tool schemas and JSON back and forth, the gap is huge.
A simple mental model for comparing properly
You do not need a spreadsheet. You need two numbers in your head.
Step one: figure out what kind of content you actually send. Mostly chat with humans? Plain English. Building an agent? Tools, JSON and code. Summarising documents? Probably mostly prose, but check.
Step two: apply a rough multiplier. For plain English, treat the OpenAI sticker price as roughly real, and tack on around 50 to 60 percent for Claude. For tool heavy or JSON heavy work, the gap can easily double or treble. Gemini sits in the middle for most things.
So next time you see 'Model X is half the price of Model Y' in a marketing comparison, ask the question that should always be asked when somebody is comparing prices: half the price of what?
The other hidden costs the sticker price does not mention
Even after you correct for tokenizer overhead, there are at least four more things the per million tokens number quietly leaves off:
- Output tokens cost more than input tokens. On most providers, the model generating text costs 3 to 5 times what feeding text in costs. If your app produces long responses, the input side of the price tag is the smaller half of the bill.
- Prompt caching changes the maths. If you reuse the same big system prompt across thousands of requests, most providers now let you cache it for a steep discount on subsequent reads. That can knock 70 to 90 percent off the input cost, but only if you are set up to use it.
- Long context tiers. Several providers charge different rates once your context window crosses a threshold (usually 128k or 200k tokens). Cheap below the line, suddenly not below the line above it.
- Thinking tokens. Reasoning models bill you for the internal 'thinking' tokens you never see in the output. On a tricky problem, that hidden working out can be the bulk of the cost.
TensorZero's analysis cheerfully points out that their 5.3x figure is just the input token side. Once you stack the rest on, real workload costs can drift even further from the menu price.
What this means if you are just trying to pick a model
None of this means 'always go with whoever has the most efficient tokenizer'. The most efficient tokenizer for your workload might still be attached to a worse model for your task, or a slower one, or one with a more painful rate limit.
The point is narrower. Sticker price is a starting point, not an answer. If you are picking between two providers and they are within a factor of two on the menu, the tokenizer alone can flip which one is actually cheaper. So spend a few minutes with your real prompts before committing.
The cheapest cup of coffee, like the cheapest API call, is the one that gives you what you actually needed without the side of regret.