Ternary Bonsai: AI That Runs on Your iPhone
A new family of AI models uses 1.58-bit weights to fit an 8B model in 1.75 GB. It actually runs on a phone. Here's how, and where it falls short.

For the last three years, running a useful AI model locally has meant either spending several hundred pounds on a beefy Mac or making peace with the cloud. Ternary Bonsai is one of the first releases that genuinely lets you run a useful 8-billion-parameter model on a phone. The trick is a quantization scheme that drops each weight from 16 bits down to 1.58 bits, which sounds like it shouldn't work but mostly does.
What does "1.58-bit" actually mean?
Normal AI models store each weight (each parameter) as a 16-bit floating-point number, which gives you about 65,000 possible values per weight. That's the source of the model's expressiveness; it's also why an 8-billion-parameter model needs 16 GB just to load.
Ternary quantization keeps only three possible values per weight: -1, 0, or +1, plus a shared scaling factor across each group of 128 weights. The information theory of choosing one of three states needs log2(3) bits, which works out to roughly 1.58 bits per weight. Hence the name.
The intuition for why this works: most weights in a trained model are close to zero anyway. Forcing them to exactly -1, 0, or +1 (scaled per group) throws away precision that the model wasn't really using. Train carefully and the quality loss is much smaller than you'd expect. Per the Ternary Bonsai release notes, the 8B model averages 75.5 across standard benchmarks (MMLU, HumanEval+, GSM8K and others), within a few points of Qwen3 8B, which is full-precision and runs nowhere near as fast on a phone.
Why does this matter for everyday devices?
Three real shifts.
RAM stops being the wall. A 16 GB Mac historically wouldn't load a serious 8B model with any usable context window. Ternary Bonsai 8B fits in 1.75 GB, which leaves the entire rest of the machine free for whatever else you're doing. The same logic scales up: a Mac that could only run 7B models in regular quantization can now run 30B-class models in ternary form.
The second is throughput. 82 tokens per second on an M4 Pro is well above conversational speed. That's the difference between "I asked the local model a question and waited" and "the local model replied as fast as the cloud would have". The cloud always wins on the largest models, but for the kinds of questions a personal assistant handles, the local version is now fast enough.
The third is privacy. A model running on your device sends nothing to a provider. For tasks that touch personal documents (drafting an email, summarising notes, asking about a contract) the privacy story for local models is fundamentally different from any cloud model regardless of how good its data practices are.
How does it compare to a hosted model?
Honestly: it's good, not great. Ternary Bonsai 8B's 75.5 benchmark average is competitive with other 8B models but well behind frontier hosted models like Claude Opus 4.7 or GPT-5, which are bigger and have far more training compute behind them. For tasks where you'd actually notice the gap (long-form reasoning, complex coding, nuanced writing) the hosted models stay clearly ahead.
For tasks where you wouldn't: short questions, document summarisation, drafting routine replies, pulling fields out of free text. The 8B local model is comparable enough that you wouldn't be able to tell the difference blindfolded. That's the practical band where running locally starts to make sense.
A useful frame: hosted models for the work where quality is the constraint, local models for the work where privacy, latency or always-availability is the constraint. They're not in direct competition; they're tools for different parts of the same job.
How would you actually run it?
Four practical paths from easiest to most ambitious.
LM Studio or Ollama on a Mac
If you're on Apple Silicon, Ollama and LM Studio both wrap the MLX runtime that Ternary Bonsai targets. Search for the model in the catalogue (or paste in the Hugging Face URL), download, and chat. Fifteen minutes of setup. No code.
Direct MLX from Python
If you're comfortable with Python, mlx-lm gives you a few-line API: load the model, give it a prompt, stream the response. Works for scripts and personal agents you want to wire into your own tools. Apple's MLX docs are good.
Self-hosted chat UI
Open WebUI or LibreChat point at a local Ollama server and give you a polished chat interface. Useful when you want the local model to feel like the cloud-chat UX you already use.
iPhone via the MLX iOS sample
Apple ships a sample app showing how to run MLX models on-device. Building it requires Xcode and a developer account; it's not the easy path. But seeing a 1.7B-parameter model reply at 27 tokens per second on a phone is worth the setup for the realisation alone.
Where does it fall short?
Three honest limitations worth knowing before you build anything on it.
Apple-first ecosystem. The headline performance numbers depend on MLX, which runs on Apple Silicon and nowhere else. Running on a Windows or Linux machine is possible but the optimised path doesn't exist; you'd be using a less efficient ternary-quantized runtime and seeing worse numbers. If your hardware isn't Apple, the value proposition is much weaker.
Context window. The release notes don't emphasise context length, and ternary models historically have shorter usable contexts than full-precision equivalents. For long-document tasks, the hosted models still win convincingly.
Specialised tasks. The benchmarks are general-purpose; on narrower specialisations (medical, legal, code in a specific framework), the model behaves like a generic 8B and may underperform a specialised hosted model. For specialised work, the local version is a starting point, not a destination.
None of these makes Ternary Bonsai uninteresting. They make it a tool for a specific class of jobs (privacy-sensitive, low-latency, Apple-Silicon-native) rather than a replacement for a frontier model.
Local AI Image Generators: A Beginner's Guide for 2026
Free AI Tools You Should Be Using in 2026