An 8-billion parameter language model that runs on an iPhone at 40 tokens per second, fits in 1.15 GB of memory, and uses a fraction of the energy of standard models. That is what PrismML announced when it emerged from stealth in late March 2026, releasing the world’s first commercially viable 1-bit large language models under an open-source Apache 2.0 license.
The company’s Bonsai family of models — 8B, 4B, and 1.7B — represents a fundamental mathematical shift in how neural networks are built. Every layer, from embeddings to attention to the language model head, operates at 1-bit precision. There are no 16-bit escape hatches. The result is a model 14 times smaller than its full-precision counterparts while remaining competitive on standard benchmarks.
The Intelligence Density Leap
PrismML introduces a concept it calls “intelligence density” — the amount of useful intelligence a model delivers per gigabyte of size. By this measure, 1-bit Bonsai 8B scores 1.06/GB. The closest 8B-class competitor, Qwen3 8B, scores 0.10/GB. That is not an incremental improvement. It is a different regime.
The practical implications are immediate. A standard 16-bit 8B model requires roughly 16 GB of memory — more than any iPhone can host. Bonsai 8B needs just 1.15 GB. It runs at approximately 131 tokens per second on an M4 Pro Mac, 368 tokens per second on an RTX 4090, and roughly 44 tokens per second on an iPhone 17 Pro Max.
These are not stripped-down models sacrificing reasoning for size. On a broad benchmark suite covering instruction following, multi-step reasoning, and tool use, Bonsai 8B remains competitive with leading full-precision 8B models like Llama3 8B.
Why 1-Bit Matters Beyond Speed
The efficiency story extends well beyond memory savings. Energy consumption drops by roughly 4 to 5 times compared to 16-bit models. On an M4 Pro, Bonsai 8B requires 0.074 mWh per token. On an iPhone 17 Pro Max, just 0.068 mWh per token.
For agentic workloads — where AI systems must sustain reasoning over many steps — the throughput advantage compounds dramatically. In PrismML’s benchmark of 50 ticket summary and assignment tasks, Bonsai 8B completed all 50 in the same window that a standard 16-bit 8B model managed only 6.
But perhaps the most forward-looking detail is what current hardware does not yet exploit. The speed and energy gains come primarily from the reduced memory footprint, not from the 1-bit arithmetic itself. Because 1-bit weights allow inference with simple additions instead of multiplications, hardware designed specifically for 1-bit models could push efficiency gains by potentially another order of magnitude.
From Caltech Research to Khosla-Backed Startup
PrismML emerged from research developed at Caltech by Professor Babak Hassibi, who serves as CEO and founder. The company is backed by Khosla Ventures, Cerberus Ventures, and compute grants from Google and Caltech. The models were trained on Google v4 TPUs.
“AI’s future will not be defined by who can build the largest data centers,” said Vinod Khosla, founder of Khosla Ventures. “It will be defined by who can deliver the most intelligence per unit of energy and cost. PrismML represents that kind of breakthrough.”
Hassibi framed the work as a new paradigm: “We spent years developing the mathematical theory required to compress a neural network without losing its reasoning capabilities. We see 1-bit not as an endpoint, but as a starting point.”
Ion Stoica, Databricks co-founder and UC Berkeley professor, noted the systems-level impact: “Reducing models to 1-bit representations changes the optimization equation. It enables a new class of AI systems that can both operate efficiently at the edge and scale economically in the cloud.”
What Becomes Possible
When advanced models become small enough to run locally, the design space for AI products changes fundamentally:
- Responsiveness — On-device inference eliminates network latency entirely
- Privacy — Sensitive data never leaves the device or crosses organizational boundaries
- Reliability — Applications work without constant cloud connectivity
- Economics — AI becomes viable in settings where server-side deployment was too expensive
PrismML points to entirely new categories that open up: persistent on-device agents, real-time robotics control, secure enterprise copilots, offline intelligence for remote or constrained environments, and AI-native products built for places where bandwidth, power, or compliance previously made advanced models impractical.
The Bigger Picture: The Efficiency Turn in AI
PrismML’s launch fits into a broader inflection point in AI development. For years, the dominant strategy was to scale up — more parameters, more GPUs, more power. That approach produced remarkable capabilities but created a structural problem: the most capable intelligence became locked inside massive data center clusters.
The efficiency turn — prioritizing intelligence per unit of compute, energy, and cost — is now reshaping the competitive landscape. PrismML’s 1-bit models, Google’s TurboQuant compression research, and the broader trend toward distilled and quantized models all point in the same direction: the next major leaps in AI may come not from building bigger, but from building smarter.
If 1-bit models can deliver competitive reasoning at 14x less memory and 5x less energy on hardware that was never designed for them, the question becomes what happens when the hardware catches up. When chips are built for 1-bit inference — replacing multiplications with additions — the efficiency curve could bend sharply again.
For now, an 8-billion parameter model runs on a phone. The data center is no longer the only place where intelligence lives.
SOURCES
- PrismML — Announcing 1-bit Bonsai: The First Commercially Viable 1-bit LLMs
- PrismML — Press Release: World’s First 1-Bit AI Model
- Switas Consultancy — The Future of AGI: 5 Breakthroughs Defining April 2026