Ternary Bonsai: AI That Runs on Your iPhone

A new family of AI models uses 1.58-bit weights to fit an 8B model in 1.75 GB. It actually runs on a phone. Here's how, and where it falls short.

Close-up of an iPhone screen with a soft glow

Updated 11 June 2026 How we review →

By Rob11 June 2026 · 7 min read

For the last three years, running a useful AI model locally has meant either spending several hundred pounds on a beefy Mac or making peace with the cloud. Ternary Bonsai is one of the first releases that genuinely lets you run a useful 8-billion-parameter model on a phone. The trick is a quantization scheme that drops each weight from 16 bits down to 1.58 bits, which sounds like it shouldn't work but mostly does.

What does "1.58-bit" actually mean?

Normal AI models store each weight (each parameter) as a 16-bit floating-point number, which gives you about 65,000 possible values per weight. That's the source of the model's expressiveness; it's also why an 8-billion-parameter model needs 16 GB just to load.

Ternary quantization keeps only three possible values per weight: -1, 0, or +1, plus a shared scaling factor across each group of 128 weights. The information theory of choosing one of three states needs log2(3) bits, which works out to roughly 1.58 bits per weight. Hence the name.

The intuition for why this works: most weights in a trained model are close to zero anyway. Forcing them to exactly -1, 0, or +1 (scaled per group) throws away precision that the model wasn't really using. Train carefully and the quality loss is much smaller than you'd expect. Per the Ternary Bonsai release notes, the 8B model averages 75.5 across standard benchmarks (MMLU, HumanEval+, GSM8K and others), within a few points of Qwen3 8B, which is full-precision and runs nowhere near as fast on a phone.

Why does this matter for everyday devices?

Three real shifts.

RAM stops being the wall. A 16 GB Mac historically wouldn't load a serious 8B model with any usable context window. Ternary Bonsai 8B fits in 1.75 GB, which leaves the entire rest of the machine free for whatever else you're doing. The same logic scales up: a Mac that could only run 7B models in regular quantization can now run 30B-class models in ternary form.

The second is throughput. 82 tokens per second on an M4 Pro is well above conversational speed. That's the difference between "I asked the local model a question and waited" and "the local model replied as fast as the cloud would have". The cloud always wins on the largest models, but for the kinds of questions a personal assistant handles, the local version is now fast enough.

The third is privacy. A model running on your device sends nothing to a provider. For tasks that touch personal documents (drafting an email, summarising notes, asking about a contract) the privacy story for local models is fundamentally different from any cloud model regardless of how good its data practices are.

How does it compare to a hosted model?

Honestly: it's good, not great. Ternary Bonsai 8B's 75.5 benchmark average is competitive with other 8B models but well behind frontier hosted models like Claude Opus 4.7 or GPT-5, which are bigger and have far more training compute behind them. For tasks where you'd actually notice the gap (long-form reasoning, complex coding, nuanced writing) the hosted models stay clearly ahead.

For tasks where you wouldn't: short questions, document summarisation, drafting routine replies, pulling fields out of free text. The 8B local model is comparable enough that you wouldn't be able to tell the difference blindfolded. That's the practical band where running locally starts to make sense.

A useful frame: hosted models for the work where quality is the constraint, local models for the work where privacy, latency or always-availability is the constraint. They're not in direct competition; they're tools for different parts of the same job.

How would you actually run it?

Four practical paths from easiest to most ambitious.

LM Studio or Ollama on a Mac

If you're on Apple Silicon, Ollama and LM Studio both wrap the MLX runtime that Ternary Bonsai targets. Search for the model in the catalogue (or paste in the Hugging Face URL), download, and chat. Fifteen minutes of setup. No code.

Direct MLX from Python

If you're comfortable with Python, mlx-lm gives you a few-line API: load the model, give it a prompt, stream the response. Works for scripts and personal agents you want to wire into your own tools. Apple's MLX docs are good.

Self-hosted chat UI

Open WebUI or LibreChat point at a local Ollama server and give you a polished chat interface. Useful when you want the local model to feel like the cloud-chat UX you already use.

iPhone via the MLX iOS sample

Apple ships a sample app showing how to run MLX models on-device. Building it requires Xcode and a developer account; it's not the easy path. But seeing a 1.7B-parameter model reply at 27 tokens per second on a phone is worth the setup for the realisation alone.

Where does it fall short?

Three honest limitations worth knowing before you build anything on it.

Apple-first ecosystem. The headline performance numbers depend on MLX, which runs on Apple Silicon and nowhere else. Running on a Windows or Linux machine is possible but the optimised path doesn't exist; you'd be using a less efficient ternary-quantized runtime and seeing worse numbers. If your hardware isn't Apple, the value proposition is much weaker.

Context window. The release notes don't emphasise context length, and ternary models historically have shorter usable contexts than full-precision equivalents. For long-document tasks, the hosted models still win convincingly.

Specialised tasks. The benchmarks are general-purpose; on narrower specialisations (medical, legal, code in a specific framework), the model behaves like a generic 8B and may underperform a specialised hosted model. For specialised work, the local version is a starting point, not a destination.

None of these makes Ternary Bonsai uninteresting. They make it a tool for a specific class of jobs (privacy-sensitive, low-latency, Apple-Silicon-native) rather than a replacement for a frontier model.

Frequently asked questions

Q01Will this run on my Intel-based Mac or PC?

Technically yes (the model is open-source and Apache 2.0 licensed); practically much worse. The performance numbers in the release notes are for Apple Silicon with the MLX runtime. On an Intel Mac or a Windows machine you'd run a generic ternary-quantized inference path and see far slower token rates. If you're on non-Apple hardware, look at other quantized models like Llama 4 8B-Q4 or Qwen3 8B-Q4, which target your platform better.

Q02How does this compare to Llama 4 or Mistral?

Llama 4 8B at standard 4-bit quantization is roughly comparable on benchmarks but needs about 4 GB of RAM instead of 1.75 GB, and doesn't have MLX-native performance on Apple. Ternary Bonsai's advantage is specifically the smaller footprint plus the Apple-tuned inference path. On Linux, Llama 4 may be the better default.

Q03Is the 1.7B version actually useful?

For short questions, yes. For anything that needs structured reasoning over more than a few sentences, the 1.7B variant struggles. The 4B and 8B versions are the practical defaults; the 1.7B exists for genuinely constrained devices where the smaller version is the only one that fits.

Q04Will I notice any difference in quality vs ChatGPT or Claude?

Yes, on hard tasks. ChatGPT 5 or Claude Opus 4.7 are bigger, better-trained, and more capable on long-form work. The local model is comparable on short, well-defined tasks (summarisation, simple Q&A, basic drafting) and clearly behind on anything reasoning-heavy. Use both, send the easy stuff local and the hard stuff to the cloud.

Q05What's the battery cost of running this on an iPhone?

Significant. Inference is compute-heavy and pushes the phone harder than ordinary use. You wouldn't run a sustained chat session for an hour without noticing the battery drop. For occasional one-off questions, the cost is fine; for an always-on assistant, this generation of phones isn't quite ready.