The 'Caveman' Trick That Cuts AI Costs by 75%
Strip vowels and articles from your prompts and AI APIs cost a quarter as much. Does it actually work, and should you use it?

Every so often a counter-intuitive trick shows up in the AI world and it turns out to mostly work. The Caveman project is the latest. The idea is simple: write your prompt normally, run it through a preprocessor that removes vowels and grammar fluff, send the resulting word-soup to your AI. Cost drops dramatically. Quality stays surprisingly intact. Which is interesting, because it tells you something about how these models actually read text.
What is Caveman?
Caveman is an open-source Python tool (you can find it on GitHub) that sits between your prompt and the AI provider. It applies a series of compression rules: strip articles (a, an, the), strip most vowels from longer words, collapse common phrases into shorter forms, drop redundant punctuation. The result is a prompt that looks like an SMS from 2003 but reduces the token count by 60-80% depending on the source text.
You'd expect the answer quality to fall off a cliff. It doesn't. Modern large language models seem to handle the compressed prompts about as well as the original on most factual or analytical tasks. Creative-writing prompts degrade more visibly; reasoning prompts are roughly stable.
Why does this even work?
Two reasons, and they're more interesting than the trick itself.
The first is that token-level models don't really read words. They read tokens, which are sub-word chunks. "information" splits into roughly three tokens; "nfrmtn" splits into roughly one. The vowels weren't carrying as much information as you'd think; mostly they were carrying token-count.
The second is that LLMs were trained on a vast corpus that includes plenty of compressed text: chat logs, code, shorthand, abbreviations, low-vowel languages. The model has seen "thx" mean "thanks" and "u" mean "you". Dropping articles and most vowels doesn't push it off the distribution; it just moves toward a register the model has already absorbed.
Both of these suggest there's a real ceiling here: you can compress prose, you cannot compress information density. A prompt that already says little can be cut to almost nothing. A prompt full of nuance has a lower compression ratio before quality starts dropping.
How much do tokens really cost?
For an individual using ChatGPT or Claude through a chat interface, the answer is roughly "nothing". You pay a flat monthly fee and the per-token cost is amortised. Caveman saves you zero pounds because you're not billed per token.
For an API user (someone building an app that calls Anthropic, OpenAI or similar at scale), tokens do cost money. As of early 2026, the high-end Claude and GPT-5 tiers run at roughly £0.005 to £0.015 per 1,000 input tokens. A small per-call saving accumulates fast when you're making millions of calls. Reducing input tokens by 75% on a system that processes 100 million tokens a day saves real money, even on the cheaper models.
The hobby category in between (people on the Claude or OpenAI API for personal projects) usually sits at low single-digit pounds per month. The savings from Caveman would round to pennies.
When is the trick worth it?
Three honest use cases.
High-volume production APIs
Background pipelines that process documents, classify support tickets, summarise transcripts. Per-call savings of 75% compound across millions of calls. The compressed prompt is a one-time engineering cost; the savings recur.
Long system prompts that don't change
If you have a 4,000-token system prompt that ships with every request, compressing it once and keeping the compressed version saves on every call. The model handles the compressed system prompt just as well, and you only pay for the compression effort once.
Cost-experimentation in development
When you're sizing what an API-backed product will cost at scale, running both compressed and uncompressed variants gives you a real range. The compressed version sets the floor; the uncompressed sets the ceiling.
When does it backfire?
Three places to avoid the trick.
Anything that depends on tone. A customer-facing email summariser needs to read the original tone of the email correctly. Strip the articles and the model will still understand the words; it will lose the social register that makes the difference between a formal complaint and a passive-aggressive one. Tone-sensitive tasks should keep the prose intact.
Code-generation prompts. If you're asking the model to write or refactor code, the prompt usually IS the spec, and the spec needs to be precise. Compressing the spec is the cheapest way to ship a bug. Keep code prompts as legible as you would for a human reviewer.
Anywhere the prompt is also documentation. Long-running prompts often double as documentation for the team: "here's what we ask the model to do, here's why". A compressed prompt is unreadable by humans, which means the documentation rots faster than the code does. The team-cost outweighs the token-cost.
Free AI Tools You Should Be Using in 2026
20 Actually Useful Things to Ask ChatGPT