The 'Caveman' Trick That Cuts AI Costs by 75%

Strip vowels and articles from your prompts and AI APIs cost a quarter as much. Does it actually work, and should you use it?

Stone-age cave painting with simple symbols

Updated 11 June 2026 How we review →

By Rob11 June 2026 · 6 min read

Every so often a counter-intuitive trick shows up in the AI world and it turns out to mostly work. The Caveman project is the latest. The idea is simple: write your prompt normally, run it through a preprocessor that removes vowels and grammar fluff, send the resulting word-soup to your AI. Cost drops dramatically. Quality stays surprisingly intact. Which is interesting, because it tells you something about how these models actually read text.

What is Caveman?

Caveman is an open-source Python tool (you can find it on GitHub) that sits between your prompt and the AI provider. It applies a series of compression rules: strip articles (a, an, the), strip most vowels from longer words, collapse common phrases into shorter forms, drop redundant punctuation. The result is a prompt that looks like an SMS from 2003 but reduces the token count by 60-80% depending on the source text.

You'd expect the answer quality to fall off a cliff. It doesn't. Modern large language models seem to handle the compressed prompts about as well as the original on most factual or analytical tasks. Creative-writing prompts degrade more visibly; reasoning prompts are roughly stable.

Why does this even work?

Two reasons, and they're more interesting than the trick itself.

The first is that token-level models don't really read words. They read tokens, which are sub-word chunks. "information" splits into roughly three tokens; "nfrmtn" splits into roughly one. The vowels weren't carrying as much information as you'd think; mostly they were carrying token-count.

The second is that LLMs were trained on a vast corpus that includes plenty of compressed text: chat logs, code, shorthand, abbreviations, low-vowel languages. The model has seen "thx" mean "thanks" and "u" mean "you". Dropping articles and most vowels doesn't push it off the distribution; it just moves toward a register the model has already absorbed.

Both of these suggest there's a real ceiling here: you can compress prose, you cannot compress information density. A prompt that already says little can be cut to almost nothing. A prompt full of nuance has a lower compression ratio before quality starts dropping.

How much do tokens really cost?

For an individual using ChatGPT or Claude through a chat interface, the answer is roughly "nothing". You pay a flat monthly fee and the per-token cost is amortised. Caveman saves you zero pounds because you're not billed per token.

For an API user (someone building an app that calls Anthropic, OpenAI or similar at scale), tokens do cost money. As of early 2026, the high-end Claude and GPT-5 tiers run at roughly £0.005 to £0.015 per 1,000 input tokens. A small per-call saving accumulates fast when you're making millions of calls. Reducing input tokens by 75% on a system that processes 100 million tokens a day saves real money, even on the cheaper models.

The hobby category in between (people on the Claude or OpenAI API for personal projects) usually sits at low single-digit pounds per month. The savings from Caveman would round to pennies.

When is the trick worth it?

Three honest use cases.

High-volume production APIs

Background pipelines that process documents, classify support tickets, summarise transcripts. Per-call savings of 75% compound across millions of calls. The compressed prompt is a one-time engineering cost; the savings recur.

Long system prompts that don't change

If you have a 4,000-token system prompt that ships with every request, compressing it once and keeping the compressed version saves on every call. The model handles the compressed system prompt just as well, and you only pay for the compression effort once.

Cost-experimentation in development

When you're sizing what an API-backed product will cost at scale, running both compressed and uncompressed variants gives you a real range. The compressed version sets the floor; the uncompressed sets the ceiling.

When does it backfire?

Three places to avoid the trick.

Anything that depends on tone. A customer-facing email summariser needs to read the original tone of the email correctly. Strip the articles and the model will still understand the words; it will lose the social register that makes the difference between a formal complaint and a passive-aggressive one. Tone-sensitive tasks should keep the prose intact.

Code-generation prompts. If you're asking the model to write or refactor code, the prompt usually IS the spec, and the spec needs to be precise. Compressing the spec is the cheapest way to ship a bug. Keep code prompts as legible as you would for a human reviewer.

Anywhere the prompt is also documentation. Long-running prompts often double as documentation for the team: "here's what we ask the model to do, here's why". A compressed prompt is unreadable by humans, which means the documentation rots faster than the code does. The team-cost outweighs the token-cost.

Frequently asked questions

Q01Does this actually save money on ChatGPT Plus or Claude Pro?

No. Those are flat-rate subscriptions; you're not billed per token. The savings only apply if you're calling the model via the API and paying per-token. For chat-interface users the trick is a curiosity, not a budget tool.

Q02Will the compressed prompt confuse the model?

Usually no on factual or analytical tasks; sometimes yes on creative or tone-sensitive tasks. The safest way to find out is to A/B test on your actual prompts: send 100 calls each in compressed and uncompressed form, compare the outputs against your own quality bar. Most workflows are fine; some aren't.

Q03Is there a quality cost I should expect?

There's usually a small degradation on outputs that depend on nuance (tone, register, idiom). For most factual prompts the degradation is below the variance you'd see across two identical-prompt API calls anyway. Reasoning-heavy prompts seem robust. Code prompts are the clearest case where you should NOT compress.

Q04Can I just write compressed prompts myself instead of running a tool?

Yes, and many people do. Stripping articles and common filler is a skill you can pick up in an afternoon. The advantage of a tool is consistency across long system prompts; the advantage of doing it manually is keeping the intent visible to yourself. For a hobby project, manual compression is fine.

Q05Are model providers going to penalise this kind of thing?

There's no policy against it and no technical reason to expect one. The model bills per token regardless of how readable the tokens are. If providers wanted to push back, they'd more likely raise the per-token price than try to detect compressed prompts. So far there's no indication either is happening.