How We Built 2x Faster AI Inference on Apple Silicon
Meet momo-kibidango: An open-source project from ReillyDesignStudio that accelerates local AI on Mac with zero quality loss.
Last year, we faced a problem: our AI tools were slow.
Not "waiting a second" slow. Slow in ways that frustrated us. Slow in ways that made us think there had to be a better approach.
We tried everything:
- • Larger models (slower)
- • Cloud APIs (expensive and latency-heavy)
- • Local inference (12.5 tokens per second feels like watching paint dry)
- • Optimizations (diminishing returns)
Then we discovered something called speculative decoding — a technique from Google Research that seemed too good to be true. So we built it. We tested it. We optimized it for Apple Silicon.
Today, we're releasing momo-kibidango, an open-source implementation that achieves 2x faster inference on Apple Silicon with zero quality loss.
And we're doing it publicly, so you can use it too.
The Problem: Speed vs. Quality
Here's what most people don't realize about AI inference: it's not just about throughput.
Throughput ≠ Utility
Running a large language model on your Mac generates tokens fast — if you wait for it to load first. That initial load? 2-5 minutes just to initialize.
Cloud APIs are instant (no load time), but expensive. A thousand API calls gets pricey, and latency becomes your bottleneck.
- Local inference: Fast per-token, but slow first response
- Cloud inference: Instant first response, but expensive at scale
- What we needed: Both — instant response AND cheap operation
After diving into benchmarks, we realized the answer wasn't faster hardware. It was smarter inference.
The Solution: Speculative Decoding
Speculative decoding is elegant and simple:
- 1. Draft phase: A fast model generates candidate tokens quickly
- 2. Verify phase: A powerful model checks if those candidates are correct
- 3. Accept or reject: Keep ones the powerful model agrees with (~80-90%)
- 4. Refine: Generate correct tokens for rejected ones
- 5. Return: Combine results in correct order
You get 95% quality from the powerful model at 2x the speed.
Real-World Impact
Let's say you ask for code review of a Swift file (500 tokens of response):
| Approach | Time | Quality | Cost |
|---|---|---|---|
| Local Inference | 10 sec | 6/10 | $0.00 |
| Cloud GPU | 23 sec | 9/10 | $0.0004 |
| momo-kibidango | 6 sec | 9/10 | $0.0002 |
2.5x faster than cloud. Same quality. Half the cost.
Why We Open-Sourced It
We could have kept this to ourselves — it's a competitive advantage.
But here's what we believe: infrastructure tools are better when they're open.
CUDA dominated because everyone could use it, contribute to it, understand it. An ecosystem built around it.
We're building momo-kibidango the same way:
- • Developers can integrate it into their projects
- • Researchers can extend it
- • The community can verify it works
- • Everyone benefits as it improves
Closed tools build moats. Open tools build ecosystems.
We'd rather build an ecosystem.
The Tech Stack
momo-kibidango implements Google Research's Pyramid Speculative Decoding:
Three-Tier Model Stack
- Tier 1 (Draft): Claude Haiku 2 — 45.6 tok/sec
- Tier 2 (Verify): Claude Haiku 3 — 30.5 tok/sec
- Tier 3 (Authority): Claude Sonnet 3.5 — 12.5 tok/sec
Key: Smart caching across all three models. Memory efficient. Elegant. Fast.
Memory Requirements
- • 11.6 GB sustained (fits in 16GB Macs)
- • M1/M2/M3/M4 compatible
- • No GPU required
- • Works offline
How We Built It
Research Phase (Dec 2025 - Feb 2026)
- • 25+ academic papers on speculative decoding
- • 5+ implementations analyzed
- • Google's Orion architecture evaluated
- • Custom Apple Silicon optimizations
Implementation Phase (Feb - Mar 2026)
- • Core inference engine
- • OpenClaw integration
- • 15-scenario benchmarking
- • Latency optimization
Testing Phase (Mar 2026)
- • 1.97x speedup validated
- • Zero quality loss verified
- • Edge cases & fallback scenarios tested
Total: ~300 engineering hours to production-ready
Getting Started
pip install momo-kibidango
momo-kibidango test
from momo_kibidango import AcceleratedInference
inference = AcceleratedInference(model="claude-sonnet")
response = inference.generate("Review this code for bugs")
# 6 seconds, 95% quality, fraction of cloud costFull documentation: https://momo-kibidango.org
What's Next
v1.0.0 is production-ready. Coming:
- • GPU support
- • Training optimization
- • Advanced tools (profiler, memory analyzer)
- • More hardware optimization
- • Integration templates
We use this in production. We're committed to maintaining it.
Our Philosophy at ReillyDesignStudio
- Deep expertise. First principles, not wrappers
- Open contribution. Infrastructure belongs in the open (MIT licensed)
- Real-world focus. Production benchmarks matter most
- Humble optimization. Good solution, inviting community improvement
Problems worth solving are worth solving in public.
Try It Today
- GitHub: https://github.com/rdreilly58/momo-kibidango
- Docs: https://momo-kibidango.org
- PyPI:
pip install momo-kibidango - Questions? Open an issue. Have ideas? Send a PR.
Let the fast model draft. Let the smart model verify. Combine them. Get both speed and quality.
That's momo-kibidango. That's what we're proud to share.
Try it. Use it. Improve it. Let's build better AI infrastructure together. 🍑
Robert Reilly
CEO & Founder, ReillyDesignStudio
Passionate about AI infrastructure that works in the real world