← Back to context

Comment by shubham2802

3 months ago

RunAnywhere builds software that makes AI models run fast locally on devices instead of sending requests to the cloud.

Right now, our focus is Apple Silicon.

Today there are two parts:

MetalRT - our proprietary inference engine for Apple Silicon. It speeds up local LLM, speech-to-text, and text-to-speech workloads. We’re expanding model coverage over time, with more modalities and broader support coming next.

RCLI - our open-source CLI that shows this in practice. You can talk to your Mac, query local docs, and trigger actions, all fully on-device.

So the simplest way to think about us is: we’re building the runtime / infrastructure layer for on-device AI, and RCLI is one example of what that enables.

Longer term, we want to bring the same approach to more chips and device types, not just Apple Silicon.

For people asking whether the speedups are real, we’ve published our benchmark methodology and results here: LLM: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e... Speech: https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...

From LLM benchmarks it looks like it's better to use open source uzu than RunAnywhere's proprietary inference engine.

[0] https://github.com/trymirai/uzu

  • uzu is a strong engine, it beat us on Llama-3.2-3B (222 vs 184 tok/s) and we reported that honestly in our benchmarks.

    But looking at the full picture across all four models tested:

    Qwen3-0.6B: MetalRT 658, uzu 627

    Qwen3-4B: MetalRT 186, uzu 165

    Llama-3.2-3B: uzu 222, MetalRT 184

    LFM2.5-1.2B: MetalRT 570, uzu 550

    MetalRT wins 3 of 4. The bigger difference is that MetalRT also handles STT and TTS natively, uzu is LLM-only. For a voice pipeline where you need all three modalities running on one engine with shared memory management, that matters.

    That said, uzu is great open-source software and worth checking out if your looking for an OSS LLM-only engine on Apple Silicon.

How does it compare for models of any meaningful size?

These 0.6B-4B models are, frankly, just amusing curiosities. But commonly regarded as too error prone for any non-demo work.

The reason why people are buying Apple Silicon today is because the unified memory allows them to run larger models that are cost prohibitive to run otherwise (usually requiring Nvidia server GPUs). It would be much more interesting to see benchmarks for things like Qwen3.5-122B-A10B, GLM-5, or any dense model is the 20b+ range. Thanks.

  • Agreed. The real value proposition of Apple Silicon for local inference is running models that won't fit on consumer GPUs. I run Qwen 70B 4-bit on an M2 Max 96GB through llama.cpp and it's usable — not fast, but the unified memory means it actually loads. Would be interested to see MetalRT benchmarks at that scale, since the architectural advantages (fused kernels, reduced dispatch overhead) should matter more as models get memory-bandwidth-bound.

  • Fair criticism. Our benchmarks are on small models because MetalRT was built for the voice pipeline use case, where decode latency on 0.6B-4B models is the bottleneck.

    You're right that the bigger opportunity on Apple Silicon is large models that don't fit on consumer GPUs. Expanding MetalRT to 7B, 14B, 32B+ is on the roadmap. The architectural advantages(that MetalRT has) should matter even more at that scale where everything becomes memory-bandwidth-bound.

    We'll publish benchmarks on larger models as we add support. If you have a specific model/size you'd want to see first, that helps us prioritize.