Open SourceMarch 18, 2026•12 min read

Introducing momo-kiji: CUDA for Apple Neural Engine

We're building the open-source SDK for Apple Neural Engine the way CUDA revolutionized GPU computing.

Apple Neural Engine has been around since 2017. It powers:

• Every Mac since M1
• Every modern iPhone and iPad
• Every Apple TV

Yet there's still no official SDK to use it.

Developers who want to harness ANE fall back to CoreML (a black box), reverse-engineer private APIs (undocumented and risky), or hope their models just work (they don't, consistently).

We're fixing that. Today, we're introducing momo-kiji.

The Problem

Imagine if NVIDIA released GPUs with no CUDA SDK. Developers would be frustrated, performance would be mysterious, and most of the GPU's potential would go untapped.

That's the ANE situation today.

The numbers are staggering:

• ANE is 10-100x more efficient than GPUs for inference
• Billions of Apple devices have ANE
• But developers have no standard way to use it

Why This Matters

Efficiency: ANE can run inference models that would drain a GPU in seconds
Privacy: Models run locally, never leaving the device
Accessibility: Bring advanced ML to consumer hardware at scale
Economics: Inference is becoming a billion-dollar opportunity for on-device ML

The Current State

Today, developing for ANE is like the wild west:

• CoreML gives you limited control (it's a black box)
• Power users reverse-engineer private APIs (against the terms of service)
• Researchers publish findings scattered across papers and GitHub repos
• Performance varies wildly (20x difference, no one knows why)

This is unacceptable.

Introducing momo-kiji

momo-kiji is an open-source framework that brings clarity, control, and community to ANE development.

We're building what Apple hasn't: a unified SDK for Apple Neural Engine, inspired by how CUDA revolutionized GPU computing.

What We're Building

High-Level API

that's familiar to ML developers

Open Intermediate Representation

for ANE with documented standards

Compiler Framework

with transparency and optimization passes

Debugging & Profiling Tools

built-in for understanding performance

Why Now

Latest Research: The Orion paper reveals ANE internals in unprecedented detail
Community Hunger: Developers want better tools and clear documentation
Hardware Evolution: M5 with distributed accelerators changes the game
Market Timing: On-device AI is exploding

The Dream

In 5 years, developers can write ANE kernels in Python, profile with standard tools, and achieve 10x speedup.

Just like CUDA users do today.

Getting Involved

This is Phase 1 (research & documentation). We're publishing everything — specs, design proposals, and roadmaps.

View on GitHub

Bob Reilly created momo-kiji to democratize Apple Neural Engine development.