Comment by stingraycharles

3 months ago

I’m a bit confused by what you’re offering. Is it a voice assistant / AI as described on your GitHub? Or is it more general purpose / LLM ?

How does the RAG fit in, a voice-to-RAG seems a bit random as a feature?

I don’t mean to come across as dismissive, I’m genuinely confused as to what you’re offering.

RunAnywhere builds software that makes AI models run fast locally on devices instead of sending requests to the cloud.

Right now, our focus is Apple Silicon.

Today there are two parts:

MetalRT - our proprietary inference engine for Apple Silicon. It speeds up local LLM, speech-to-text, and text-to-speech workloads. We’re expanding model coverage over time, with more modalities and broader support coming next.

RCLI - our open-source CLI that shows this in practice. You can talk to your Mac, query local docs, and trigger actions, all fully on-device.

So the simplest way to think about us is: we’re building the runtime / infrastructure layer for on-device AI, and RCLI is one example of what that enables.

Longer term, we want to bring the same approach to more chips and device types, not just Apple Silicon.

For people asking whether the speedups are real, we’ve published our benchmark methodology and results here: LLM: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e... Speech: https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...

  • From LLM benchmarks it looks like it's better to use open source uzu than RunAnywhere's proprietary inference engine.

    [0] https://github.com/trymirai/uzu

    • uzu is a strong engine, it beat us on Llama-3.2-3B (222 vs 184 tok/s) and we reported that honestly in our benchmarks.

      But looking at the full picture across all four models tested:

      Qwen3-0.6B: MetalRT 658, uzu 627

      Qwen3-4B: MetalRT 186, uzu 165

      Llama-3.2-3B: uzu 222, MetalRT 184

      LFM2.5-1.2B: MetalRT 570, uzu 550

      MetalRT wins 3 of 4. The bigger difference is that MetalRT also handles STT and TTS natively, uzu is LLM-only. For a voice pipeline where you need all three modalities running on one engine with shared memory management, that matters.

      That said, uzu is great open-source software and worth checking out if your looking for an OSS LLM-only engine on Apple Silicon.

  • How does it compare for models of any meaningful size?

    These 0.6B-4B models are, frankly, just amusing curiosities. But commonly regarded as too error prone for any non-demo work.

    The reason why people are buying Apple Silicon today is because the unified memory allows them to run larger models that are cost prohibitive to run otherwise (usually requiring Nvidia server GPUs). It would be much more interesting to see benchmarks for things like Qwen3.5-122B-A10B, GLM-5, or any dense model is the 20b+ range. Thanks.

    • Agreed. The real value proposition of Apple Silicon for local inference is running models that won't fit on consumer GPUs. I run Qwen 70B 4-bit on an M2 Max 96GB through llama.cpp and it's usable — not fast, but the unified memory means it actually loads. Would be interested to see MetalRT benchmarks at that scale, since the architectural advantages (fused kernels, reduced dispatch overhead) should matter more as models get memory-bandwidth-bound.

    • Fair criticism. Our benchmarks are on small models because MetalRT was built for the voice pipeline use case, where decode latency on 0.6B-4B models is the bottleneck.

      You're right that the bigger opportunity on Apple Silicon is large models that don't fit on consumer GPUs. Expanding MetalRT to 7B, 14B, 32B+ is on the roadmap. The architectural advantages(that MetalRT has) should matter even more at that scale where everything becomes memory-bandwidth-bound.

      We'll publish benchmarks on larger models as we add support. If you have a specific model/size you'd want to see first, that helps us prioritize.

Fair question, let me clarify.

RunAnywhere is an inference company. We build the runtime layer for on-device AI.

There are two pieces:

MetalRT, a proprietary GPU inference engine for Apple Silicon. It runs LLMs, speech-to-text, and text-to-speech faster than anything else available (benchmarks: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...). This is our core product.

RCLI, an open-source CLI (MIT) that demonstrates what MetalRT enables. It wires STT + LLM + TTS into a real voice pipeline with 43 macOS actions, local RAG, and a TUI. Think of it as the reference application built on top of the engine.

On RAG specifically: voice + document Q&A is a natural pairing for on-device use cases. You have sensitive documents you don't want to upload to the cloud, you ingest them locally, and then ask questions by voice. The retrieval runs at ~4ms over 5K+ chunks, so it feels instant in the voice pipeline. Its not random, it's one of the strongest privacy arguments for running everything locally.

The longer-term vision is bringing MetalRT to more chips and platforms, so any developer can get cloud-competitive inference on-device with minimal integration effort.

I came to the comments here to see if anyone had worked out what it is, so you're not alone.

From the TFA: Document Intelligence (RAG): Ingest docs, ask questions by voice — ~4ms hybrid retrieval.

Seems pretty clear. You can supply documents to the model as input and then verbally ask questions about them.