Back to Blog
Open SourceMarch 20, 20268 min read

How We Built 2x Faster AI Inference on Apple Silicon

Meet momo-kibidango: An open-source project from ReillyDesignStudio that accelerates local AI on Mac with zero quality loss.

Last year, we faced a problem: our AI tools were slow.

Not "waiting a second" slow. Slow in ways that frustrated us. Slow in ways that made us think there had to be a better approach.

We tried everything:

  • • Larger models (slower)
  • • Cloud APIs (expensive and latency-heavy)
  • • Local inference (12.5 tokens per second feels like watching paint dry)
  • • Optimizations (diminishing returns)

Then we discovered something called speculative decoding — a technique from Google Research that seemed too good to be true. So we built it. We tested it. We optimized it for Apple Silicon.

Today, we're releasing momo-kibidango, an open-source implementation that achieves 2x faster inference on Apple Silicon with zero quality loss.

And we're doing it publicly, so you can use it too.

The Problem: Speed vs. Quality

Here's what most people don't realize about AI inference: it's not just about throughput.

Throughput ≠ Utility

Running a large language model on your Mac generates tokens fast — if you wait for it to load first. That initial load? 2-5 minutes just to initialize.

Cloud APIs are instant (no load time), but expensive. A thousand API calls gets pricey, and latency becomes your bottleneck.

  • Local inference: Fast per-token, but slow first response
  • Cloud inference: Instant first response, but expensive at scale
  • What we needed: Both — instant response AND cheap operation

After diving into benchmarks, we realized the answer wasn't faster hardware. It was smarter inference.

The Solution: Speculative Decoding

Speculative decoding is elegant and simple:

  1. 1. Draft phase: A fast model generates candidate tokens quickly
  2. 2. Verify phase: A powerful model checks if those candidates are correct
  3. 3. Accept or reject: Keep ones the powerful model agrees with (~80-90%)
  4. 4. Refine: Generate correct tokens for rejected ones
  5. 5. Return: Combine results in correct order

You get 95% quality from the powerful model at 2x the speed.

Real-World Impact

Let's say you ask for code review of a Swift file (500 tokens of response):

ApproachTimeQualityCost
Local Inference10 sec6/10$0.00
Cloud GPU23 sec9/10$0.0004
momo-kibidango6 sec9/10$0.0002

2.5x faster than cloud. Same quality. Half the cost.

Why We Open-Sourced It

We could have kept this to ourselves — it's a competitive advantage.

But here's what we believe: infrastructure tools are better when they're open.

CUDA dominated because everyone could use it, contribute to it, understand it. An ecosystem built around it.

We're building momo-kibidango the same way:

  • • Developers can integrate it into their projects
  • • Researchers can extend it
  • • The community can verify it works
  • • Everyone benefits as it improves

Closed tools build moats. Open tools build ecosystems.

We'd rather build an ecosystem.

The Tech Stack

momo-kibidango implements Google Research's Pyramid Speculative Decoding:

Three-Tier Model Stack

  • Tier 1 (Draft): Claude Haiku 2 — 45.6 tok/sec
  • Tier 2 (Verify): Claude Haiku 3 — 30.5 tok/sec
  • Tier 3 (Authority): Claude Sonnet 3.5 — 12.5 tok/sec

Key: Smart caching across all three models. Memory efficient. Elegant. Fast.

Memory Requirements

  • • 11.6 GB sustained (fits in 16GB Macs)
  • • M1/M2/M3/M4 compatible
  • • No GPU required
  • • Works offline

How We Built It

Research Phase (Dec 2025 - Feb 2026)

  • • 25+ academic papers on speculative decoding
  • • 5+ implementations analyzed
  • • Google's Orion architecture evaluated
  • • Custom Apple Silicon optimizations

Implementation Phase (Feb - Mar 2026)

  • • Core inference engine
  • • OpenClaw integration
  • • 15-scenario benchmarking
  • • Latency optimization

Testing Phase (Mar 2026)

  • • 1.97x speedup validated
  • • Zero quality loss verified
  • • Edge cases & fallback scenarios tested

Total: ~300 engineering hours to production-ready

Getting Started

pip install momo-kibidango

momo-kibidango test

from momo_kibidango import AcceleratedInference
inference = AcceleratedInference(model="claude-sonnet")
response = inference.generate("Review this code for bugs")
# 6 seconds, 95% quality, fraction of cloud cost

Full documentation: https://momo-kibidango.org

What's Next

v1.0.0 is production-ready. Coming:

  • • GPU support
  • • Training optimization
  • • Advanced tools (profiler, memory analyzer)
  • • More hardware optimization
  • • Integration templates

We use this in production. We're committed to maintaining it.

Our Philosophy at ReillyDesignStudio

  • Deep expertise. First principles, not wrappers
  • Open contribution. Infrastructure belongs in the open (MIT licensed)
  • Real-world focus. Production benchmarks matter most
  • Humble optimization. Good solution, inviting community improvement

Problems worth solving are worth solving in public.

Try It Today

Let the fast model draft. Let the smart model verify. Combine them. Get both speed and quality.

That's momo-kibidango. That's what we're proud to share.

Try it. Use it. Improve it. Let's build better AI infrastructure together. 🍑

Robert Reilly

CEO & Founder, ReillyDesignStudio

Passionate about AI infrastructure that works in the real world