Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon

3 months ago (github.com)

Hi HN, we're Sanchit and Shubham (YC W26). We built a fast inference engine for Apple Silicon. LLMs, speech-to-text, text-to-speech – MetalRT beats llama.cpp, Apple's MLX, Ollama, and sherpa-onnx on every modality we tested. Custom Metal shaders, no framework overhead.

Also, we've open-sourced RCLI, the fastest end-to-end voice AI pipeline on Apple Silicon. Mic to spoken response, entirely on-device. No cloud, no API keys.

To get started:

  brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
  brew install rcli
  rcli setup   # downloads ~1 GB of models
  rcli         # interactive mode with push-to-talk

Or:

  curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

The numbers (M4 Max, 64 GB, reproducible via `rcli bench`):

LLM decode – 1.67x faster than llama.cpp, 1.19x faster than Apple MLX (same model files): - Qwen3-0.6B: 658 tok/s (vs mlx-lm 552, llama.cpp 295) - Qwen3-4B: 186 tok/s (vs mlx-lm 170, llama.cpp 87) - LFM2.5-1.2B: 570 tok/s (vs mlx-lm 509, llama.cpp 372) - Time-to-first-token: 6.6 ms

STT – 70 seconds of audio transcribed in *101 ms*. That's 714x real-time. 4.6x faster than mlx-whisper.

TTS – 178 ms synthesis. 2.8x faster than mlx-audio and sherpa-onnx.

We built this because demoing on-device AI is easy but shipping it is brutal. Voice is the hardest test: you're chaining STT, LLM, and TTS sequentially, and if any stage is slow, the user feels it. Most teams fall back to cloud APIs not because local models are bad, but because local inference infrastructure is.

The thing that's hard to solve is latency compounding. In a voice pipeline, you're stacking three models in sequence. If each adds 200ms, you're at 600ms before the user hears a word, and that feels broken. You can't optimize one stage and call it done. Every stage needs to be fast, on one device, with no network round-trip to hide behind.

We went straight to Metal. Custom GPU compute shaders, all memory pre-allocated at init (zero allocations during inference), and one unified engine for all three modalities instead of stitching separate runtimes together.

MetalRT is the first engine to handle all three modalities natively on Apple Silicon. Full methodology:

LLM benchmarks: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...

Speech benchmarks: https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...

How: Most inference engines add layers between you and the GPU: graph schedulers, runtime dispatchers, memory managers. MetalRT skips all of it. Custom Metal compute shaders for quantized matmul, attention, and activation - compiled ahead of time, dispatched directly.

Voice Pipeline optimizations details: https://www.runanywhere.ai/blog/fastvoice-on-device-voice-ai... RAG optimizations: https://www.runanywhere.ai/blog/fastvoice-rag-on-device-retr...

RCLI is the open-source voice pipeline (MIT) built on MetalRT: three concurrent threads with lock-free ring buffers, double-buffered TTS, 38 macOS actions by voice, local RAG (~4 ms over 5K+ chunks), 20 hot-swappable models, and a full-screen TUI with per-op latency readouts. Falls back to llama.cpp when MetalRT isn't installed.

Source: https://github.com/RunanywhereAI/RCLI (MIT)

Demo: https://www.youtube.com/watch?v=eTYwkgNoaKg

What would you build if on-device AI were genuinely as fast as cloud?

166 comments

sanchitmonga22

stingraycharles 3 months ago

I’m a bit confused by what you’re offering. Is it a voice assistant / AI as described on your GitHub? Or is it more general purpose / LLM ?

How does the RAG fit in, a voice-to-RAG seems a bit random as a feature?

I don’t mean to come across as dismissive, I’m genuinely confused as to what you’re offering.

shubham2802 3 months ago
RunAnywhere builds software that makes AI models run fast locally on devices instead of sending requests to the cloud.
Right now, our focus is Apple Silicon.
Today there are two parts:
MetalRT - our proprietary inference engine for Apple Silicon. It speeds up local LLM, speech-to-text, and text-to-speech workloads. We’re expanding model coverage over time, with more modalities and broader support coming next.
RCLI - our open-source CLI that shows this in practice. You can talk to your Mac, query local docs, and trigger actions, all fully on-device.
So the simplest way to think about us is: we’re building the runtime / infrastructure layer for on-device AI, and RCLI is one example of what that enables.
Longer term, we want to bring the same approach to more chips and device types, not just Apple Silicon.
For people asking whether the speedups are real, we’ve published our benchmark methodology and results here: LLM: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e... Speech: https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...
- mirekrusin 3 months ago
  
  From LLM benchmarks it looks like it's better to use open source uzu than RunAnywhere's proprietary inference engine.
  [0] https://github.com/trymirai/uzu
  
  1 reply →
- concats 3 months ago
  
  How does it compare for models of any meaningful size?
  These 0.6B-4B models are, frankly, just amusing curiosities. But commonly regarded as too error prone for any non-demo work.
  The reason why people are buying Apple Silicon today is because the unified memory allows them to run larger models that are cost prohibitive to run otherwise (usually requiring Nvidia server GPUs). It would be much more interesting to see benchmarks for things like Qwen3.5-122B-A10B, GLM-5, or any dense model is the 20b+ range. Thanks.
  
  3 replies →
- hudtaylor 3 months ago
  
  [dead]
sanchitmonga22 3 months ago

Fair question, let me clarify.
RunAnywhere is an inference company. We build the runtime layer for on-device AI.
There are two pieces:
MetalRT, a proprietary GPU inference engine for Apple Silicon. It runs LLMs, speech-to-text, and text-to-speech faster than anything else available (benchmarks: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...). This is our core product.
RCLI, an open-source CLI (MIT) that demonstrates what MetalRT enables. It wires STT + LLM + TTS into a real voice pipeline with 43 macOS actions, local RAG, and a TUI. Think of it as the reference application built on top of the engine.
On RAG specifically: voice + document Q&A is a natural pairing for on-device use cases. You have sensitive documents you don't want to upload to the cloud, you ingest them locally, and then ask questions by voice. The retrieval runs at ~4ms over 5K+ chunks, so it feels instant in the voice pipeline. Its not random, it's one of the strongest privacy arguments for running everything locally.
The longer-term vision is bringing MetalRT to more chips and platforms, so any developer can get cloud-competitive inference on-device with minimal integration effort.
drcongo 3 months ago

I came to the comments here to see if anyone had worked out what it is, so you're not alone.
glitchc 3 months ago

From the TFA: Document Intelligence (RAG): Ingest docs, ask questions by voice — ~4ms hybrid retrieval.
Seems pretty clear. You can supply documents to the model as input and then verbally ask questions about them.

vessenes 3 months ago

Just tried it. really cool, and a fun tech demo with rcli. I filed a bug report; not everything is loading properly when installed via homebrew.

Quick request: unsloth quants; bit per bit usually better. Or more generally UI for huggingface model selections. I understand you won't be able to serve everything, but I want to mix and match!

Also - grounding:

"open safari" (safari opens, voice says: "I opened safari") "navigate to google.com in safari" (nothing happens, voice says: "I navigated to google.com")

Anyway, really fun.

sanchitmonga22 3 months ago

Thanks for trying it and for filing the bug, we're looking into the homebrew install issue.
On unsloth quants: agreed, they're consistently better bit-for-bit. Adding broader quantization format support (including unsloth's approach) is on the roadmap. Right now MetalRT works with MLX 4-bit files and GGUF Q4_K_M, we want to expand that.
On the grounding issue ("navigate to google.com" not actually navigating): you're right, that's a gap. The "open_url" action exists but the LLM doesn't always route to it correctly, especially with compound commands. Small models (0.6B-1.2B) have limited tool-calling accuracy, upgrading to Qwen3.5 4B via rcli upgrade-llm helps significantly. We're also improving the action routing prompts.
Appreciate the detailed feedback, this is exactly what we need.
blks 3 months ago
> "open safari" (safari opens, voice says: "I opened safari") "navigate to google.com in safari" (nothing happens, voice says: "I navigated to google.com")
So you’re describing a core broken feature. Application breaking at easiest test.
- sanchitmonga22 3 months ago
  
  Fair criticism. The action executed on the LLM side but didn't translate to the correct macOS action, the model hallucinated success instead of routing to the open_url tool.
  This is a known limitation with small LLMs (0.6B-1.2B) doing tool calling. They sometimes confuse "I know what you want" with "I did it." Upgrading to a larger model improves tool-calling accuracy significantly.
  We're also working on verification, having the pipeline confirm the action actually succeeded before reporting back. Thats a fair expectation and we should meet it.
  
  3 replies →
Tacite 3 months ago
How did you try it? You said on github it doesn't work.
- wlesieutre 3 months ago
  
  They said it didn't work installed from homebrew, so I assume they went back and did the curl | bash install option
  
  4 replies →
- vessenes 3 months ago
  
  It loads after those errors. Tap space and talk to it.

jonhohle 3 months ago

If I send a Portfile patch, would you consider MacPorts distribution?

halostatue 3 months ago
You're welcome to add me as a co-maintainer on this if you submit it to macports/macports-ports:
{macports.halostatue.ca:austin @halostatue}
I maintain https://github.com/macports/macports-ports/blob/master/sysut... amongst other things regularly.
sanchitmonga22 3 months ago

Absolutely, we'd welcome a Portfile contribution. Happy to review and merge. If halostatue wants to co-maintain, even better.
Feel free to open a PR or issue on the RCLI repo and we'll coordinate.
AmanSwar 3 months ago

yes please

mhamann 3 months ago

Can you help me understand MetalRT a bit more? Based on the name, it sounds like something that's Apple-only (although, Apple basically co-opted the name Metal, which was traditionally more generic). Does or will MetalRT run on more platforms?

What about MetalRT's relationship to llama.cpp, onnx, MLX, transformers, etc? Is MetalRT a replacement for those? Designed to be compatible with a wide variety of model formats? Or are you just providing an abstraction on top of these?

AmanSwar 3 months ago

MetalRT is metal only inference engine (we are making for other hardwares too). Think of it like SGLang or vLLM but for single batch inference on apple silicon. See this blogpost : https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...

tristor 3 months ago

> What would you build if on-device AI were genuinely as fast as cloud?

I think this has to be the future for AI tools to really be truly useful. The things that are truly powerful are not general purpose models that have to run in the cloud, but specialized models that can run locally and on constrained hardware, so they can be embedded.

I'd love to see this able to be added in-path as an audio passthrough device so you can add on-device native transcriptioning into any application that does audio, such as in video conferencing applications.

sanchitmonga22 3 months ago
This is a great idea. A virtual audio device that sits in the path of any audio stream and provides live transcription, that would be huge for video conferencing, lectures, podcasts.
MetalRT's STT numbers make this feasible: 70 seconds of audio transcribed in 101ms means you could process audio chunks in real-time with massive headroom. The latency would be imperceptible.
We haven't built this yet but it's a compelling use case. CoreAudio supports virtual audio devices (aggregate devices) that could pipe audio through the pipeline. If anyone in this thread has experience building macOS audio HAL plugins and wants to collaborate, we're very open to contributions, RCLI is MIT.
- tristor 3 months ago
  
  Something that could be possible is serving the model as a virtual audio device and then you can use existing tools on macOS like Rogue Amoeba's Loopback to direct audio to split to that virtual device and your other output (you'd configure your Loopback device as the output in your system audio settings).
  I have never written audio drivers on macOS, but maybe something worth exploring to see if I can make this happen. I really appreciate high quality AI transcripts in my meetings, but right now only Webex has good transcriptioning, and a lot of meetings use other services like MS Teams, Zoom, Meet, et al.

DetroitThrow 3 months ago

Wow, this is such a cool tool, and love the blog post. Latency is killer in the STT-LLM-TTS pipeline.

Before I install, is there any telemetry enabled here or is this entirely local by default?

shubham2802 3 months ago

Fully local - no data is collected!!
bigyabai 3 months ago

[flagged]

mips_avatar 3 months ago

Have you tried any really big models on a mac studio? I'm wondering what latency is like for big qwens if there's enough memory.

sanchitmonga22 3 months ago
Not yet with MetalRT, right now we support models up to ~4B parameters (Qwen3 4B, Llama 3.2 3B, LFM2.5 1.2B). These are optimized for the voice pipeline use case where decode speed and latency matter more then model size.
Expanding to larger models (7B, 14B, 32B) on machines with more unified memory is on the roadmap. The Mac Studio with 192GB would be an interesting target, a 32B model at 4-bit would fit comfortably and MetalRT's architectural advantages (fused kernels, minimal dispatch overhead) should scale well.
What model / use case are you thinking about? That helps us prioritize.
- mips_avatar 3 months ago
  
  Well it’s just more that I’ve noticed in the agents I’ve built that qwen doesn’t get reliable until around 27b so unless you want to rl small qwen I don’t think I would get much useful help out of it.
  
  1 reply →
asimovDev 3 months ago

I am running 80b Qwen coder next 4bit quant MLX version on a 96GB M3 MacBook and it responds quickly, almost immediately. I can fit the model + 128k context comfortably into the memory

rushingcreek 3 months ago

Very cool, congrats! I'm curious how you were able to achieve this given Apple's many undocumented APIs. Does it use private Neural Engine APIs or fully public Metal APIs?

Either way, this is a tremendous achievement and it's extremely relevant in the OpenClaw world where I might not want to have sensitive information leave my computer.

sanchitmonga22 3 months ago

Fully public Metal APIs, no private frameworks, no Neural Engine, no undocumented entitlements.
MetalRT is built on the public Metal API. The performance comes from how we use the GPU, not from accessing anything Apple doesn't document.
We specifically chose to stay on public APIs so that MetalRT works on any Apple Silicon Mac without special entitlements or SIP workarounds. This also means its App Store compatible for future macOS/iOS distribution.
The results speak for themselves: 1.1-1.19x faster than Apple's own MLX on identical model files, 4.6x faster on STT, 2.8x faster on TTS. Full methodology published here: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...
Appreciate the kind words, the "OpenClaw world" framing is exactly why we built this.

IlikeMadison 3 months ago

You just got your real names flagged and blacklisted on this other website that provide HR and VC partnerships. I'm not sure what you did exactly but your company and people associated with it seem really shady.

shekhar101 3 months ago

Tried this and really liking it so far. Question - is there a diarization support in the tui app or any of the models MetalRt supports? Any plans to add it if not already supported?

shubham2802 3 months ago

Yes, we do have plans to support it.

Tacite 3 months ago

Doesn't work. " zsh: segmentation fault rcli"

esafak 3 months ago
You could share your setup details, on GH if not here, to make it actionable.
- Tacite 3 months ago
  
  I did on Github. This looks vibecoded? EDIT: Dev is using Claude Code as stated in their github updates.
  
  1 reply →

Reebz 3 months ago

Do you have plans to port your proprietary library MetalRT to mobile devices? These performance gains would be a boon for privacy-centric mobile applications.

sanchitmonga22 3 months ago

Yes, mobile is our primary offering and it is on the roadmap. The same Metal GPU pipeline that powers MetalRT on macOS maps directly to iOS (same Apple Silicon, same Metal API)
shubham2802 3 months ago

Yes.

shubham2802 3 months ago

It does tries to have some memory management done too - to remember previous context + some auto compact feature.

Additionally, personality feature - try it out!! Super fun :)

mnafees 3 months ago

Seems like you are leaking an ElevenLabs API key in your web demo. The OpenAI completions endpoint also has the API key in the request header but that seems to already be revoked and is returning a 401.

shubham2802 3 months ago
I am pretty sure we don't have balance. It's a bait :)
- neya 3 months ago
  
  Sorry, but, this is not really a confidence inspiring response. Accepting the mistake and fixing the leak altogether would have been the better way to handle this. This is a developer forum, we all make mistakes. Framing it as bait just sounds like bad PR management.
  How can we trust your product if you can't fulfil basic security 101? Not being harsh but this kind of lax response for a serious mistake is not acceptable to me. Imagine I recommend you to my company and you end up leaking out our credentials and respond with something like this.
  I might be picky here about this, but long term trust starts with accountability.
  All the best on your product launch and cheers.
  
  6 replies →

tiku 3 months ago

Personally I'm so disappointed about the state of local AI. Only old models run "decent" but decent is way to slow to be usable.

sanchitmonga22 3 months ago

This is exactly the problem we're trying to solve. The models themselves have gotten surprisingly capable at small sizes, Qwen3.5 4B with 262K context, LFM2 1.2B for fast tool calling, but the inference infrastructure hasn't kept up.
When people say "local AI is too slow," they usually mean the engine is too slow, not the model. A 4B model at 186 tok/s (MetalRT on M4 Max) feels genuinely responsive for interactive chat. The same model at 87 tok/s (llama.cpp) feels sluggish. Same weights, same quality, 2x the speed, that's a usability cliff.
We think the gap between cloud and on-device inference is a infrastructure problem, not a model problem. That's what we're working on.

kevo1ution 3 months ago

congrats on the launch. i'm curious how you think about on device AI vs providers that are in the cloud. where do you see the capabilities of models that can run on the phone, and are there inherent advantages to this?

alfanick 3 months ago

I'm not looking for STT->AI->TTS, I'm looking for truly good voice-to-text experience* on Linux (and others). Siri/iOS-Dictation is truly good when it comes to understanding the speech. Something this level on Linux (and others) would be great, yeah always listening, maybe sending the data somewhere, but give me UX - hidden latency, optimizing for first chars recognized - a good (virtual) input device.

coder543 3 months ago
> Siri/iOS-Dictation is truly good when it comes to understanding the speech.
What...? It is terrible, even compared to Whisper Tiny, which was released years ago under an Apache 2.0 license so Apple could have adopted it instantly and integrated it into their devices. The bigger Whisper models are far better, and Parakeet TDT V2 (English) / V3 (Multilingual) are quite impressive and very fast.
I have no idea what would make someone say that iOS dictation is good at understanding speech... it is so bad.
For a company that talks so much about accessibility, it is baffling to me that Apple continues to ship such poor quality speech to text with their devices.
- derefr 3 months ago
  
  Maybe they have exactly the accent iOS dictation was trained to recognize.
- solarkraft 3 months ago
  
  Its quality isn’t great, but it is damn fast and that matters a lot! Whisper doesn’t even work live without hacks.
  
  1 reply →
- fragmede 3 months ago
  
  Terrible? It's fine. What's your accent that it's terrible? It even pulls last names from my address book and spells them right.
  
  6 replies →
swindmill 3 months ago
Have you tried https://handy.computer ?
- alfanick 3 months ago
  
  Not bad, almost checks all the marks I want. A) Good quality, locally run model, and surprisingly fast and working on my CPU. B) It transcribes after the session is finished (aka stopped push-to-talk, or after stopping the listening). C) Ha nice, post-processing. D) Still not solved, truly realtime transcription with latency hiding - start typing as soon as you recognize sounds (or after some logical pause, i.e. at the end of sentence). E) Written in Rust, with web-browser config ui. F) Global shortcuts are super finnicky, doesn't recognize my default "Mic" button, fair enough, let me remap to some unused F24... Doesn't recognize F24 due to missing keycode.
  It's there, doesn't feel native though. Good integration, not great though (Linux Mint/Cinnamon).
sanchitmonga22 3 months ago

Understood, you want dictation, not a chatbot. That's a valid and different use case.
RCLI is Apple Silicon only today because MetalRT is built on Metal. For Linux, the closest thing to what you're describing would be building a virtual input device on top of Whisper or Parakeet (which RCLI supports as STT backends). Parakeet TDT 0.6B has ~1.9% WER, that's very close to production dictation quality.
The missing piece on Linux isn't the model, it's the integration: a daemon that captures mic audio, runs STT with hidden latency (streaming partial results), and injects text as keyboard input. sherpa-onnx (https://github.com/k2-fsa/sherpa-onnx) supports Linux and has streaming STT, it might be the best starting point for what your after.
We're focused on Apple Silicon for now but broader platform support is on the roadmap.
dajonker 3 months ago

I use voxtype on my Linux machine with parakeet. Super fast and regularly even gets the tech lingo correct. You can configure prompts and keywords to help with that as well.
fragmede 3 months ago
> I'm not looking for STT->AI->TTS, I'm looking for truly good voice-to-text experience
Umm, ah, wait no, uhh yes you are. Unless, hang on, you are possessed with greater umm speech capabilities than most, wait nevermind start over. Unless you never make a mistake while talking, you want AI to take out the "three, wait no four" and just leave the output with "four" from what you actually spoke. Depending on your use case.
- nostrebored 3 months ago
  
  It’s the TTS layer that is weird. I’m in the same boat — speech out is just a much worse modality than text when possible.
  
  1 reply →

jonplackett 3 months ago

Really thought this was called Meta IRT and assumed it was just Facebook spyware.

RationPhantoms 3 months ago

This doesn't work on any of the methods I've tried.

shubham2802 3 months ago
Please open the issue - if it's not working ? I believe you should be able to install it via : curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/in... | bash
- harvenstar 3 months ago
  
  Cool project — been looking for something like this. Just opened a PR with a couple of new macOS actions (empty_trash + toggle_do_not_disturb). Happy to contribute more and quick chat if you're open to it.
  
  1 reply →

solarkraft 3 months ago

> Powered by MetalRT, a proprietary GPU inference engine

Too bad.

woadwarrior01 3 months ago

> Apple M3 or later required. MetalRT uses Metal 3.1 GPU features available on M3, M3 Pro, M3 Max, M4, and later chips. M1/M2 support is coming soon. On M1/M2, RCLI automatically falls back to the open-source llama.cpp engine.

So, no support for M5 Neural Accelerators, eh? (Requires Metal 4) ¯\_(ツ)_/¯

sanchitmonga22 3 months ago
Ha, not yet. Metal 4 is interesting and we're keeping an eye on it.
MetalRT currently targets Metal 3.1 GPU compute because that's where we get the most control over the decode pipeline. Neural Engine / ANE is powerful for fixed-shape inference (vision, classification) but autoregressive LLM decode, where you're generating one token at a time with dynamic KV cache, doesn't map as cleanly to ANE today.
That said, if Metal 4 opens up new capabilities that help with sequential token generation or gives better programmable access to the neural accelerator, we'll absolutely look at it. The M5 will be a fun chip to benchmark on.
- woadwarrior01 3 months ago
  
  > Neural Engine / ANE is powerful for fixed-shape inference (vision, classification) but autoregressive LLM decode, where you're generating one token at a time with dynamic KV cache, doesn't map as cleanly to ANE today.
  What does the ANE have to with this?
  Neural Engine (ANE) and the M5 Neural Accelerator (NAX) are not the same thing. NAX can accelerate LLM prefill quite dramatically, although autoregressive decoding remains memory bandwidth bound.
  I suspect the biggest blocker for Metal 4 adoption is the macOS Tahoe 26 requirement.
  
  1 reply →

computerex 3 months ago

Amazing, this is what I am trying to do with https://github.com/computerex/dlgo

sanchitmonga22 3 months ago

Cool, just checked out dlgo. Looks like you're targeting Go bindings for on-device inference? Different approach but same conviction that this should run locally. Happy to compare notes if you want to chat about Metal optimization or pipeline architecture.

brainless 3 months ago

I am interested in MetalRT. I am an indie builder, focused mostly on building products with LLM assistance that run locally. Like: https://github.com/brainless/dwata

I would be interested if MetalRT can be used by other products, if you have some plans for open source products?

sanchitmonga22 3 months ago

Yes, that's the plan. MetalRT will ship as part of the RunAnywhere SDK so other developers can integrate it into their own apps. We're working on making that available. If you want to be in the early access group, drop me a line at founder@runanywhere.ai or open an issue on the RCLI repo. Happy to look at your project.

jaimex2 3 months ago

I don't have a Mac

j45 3 months ago

"Apple M3 or later required. MetalRT uses Metal 3.1 GPU features available on M3, M3 Pro, M3 Max, M4, and later chips. M1/M2 support is coming soon. On M1/M2, RCLI automatically falls back to the open-source llama.cpp engine."

Tacite 3 months ago
Funny you mention that because on their github they just pushed an update to say that it didn't work M3 and M4.
- j45 3 months ago
  
  The quote was from the Github page.
- shubham2802 3 months ago
  
  Sorry about that but this is what is being there in github : Apple M3 or later required. MetalRT uses Metal 3.1 GPU features available on M3, M3 Pro, M3 Max, M4, and later chips. M1/M2 support is coming soon. On M1/M2, RCLI automatically falls back to the open-source llama.cpp engine.

ifh-hn 3 months ago

Faster AI inference of Apple silicon... So not run anywhere then...

sanchitmonga22 3 months ago

please check our main repo: https://github.com/RunanywhereAI/runanywhere-sdks/
We are running anywhere, hence RunAnywhere, MetalRT is the fastest inference engine we made for Apple silicon, and we'll be covering other edge devices as well, All edge about to hit Warp speed!

saritekin 3 months ago

finally! it was needed!

jawns 3 months ago

Based on the demo video, the TTS sounds like it's 10 years out of date. I would not enjoy interacting with it.

sanchitmonga22 3 months ago

The default TTS voice (Piper) is a lightweight model optimized for speed over quality. It's fast but yeah, it doesn't sound great.
If you install Kokoro TTS (rcli models > TTS section), the voice quality is dramatically better, it's a neural TTS model with 28 different voices. MetalRT synthesizes Kokoro at 178ms for short responses, so you don't pay a speed penalty for the upgrade.
We should probably make Kokoro the default or atleast make the upgrade path more obvious in the first-run experience. Fair feedback.
AmanSwar 3 months ago
Its kokoro TTS not ours, we have range of options.
- shubham2802 3 months ago
  
  Just need some few days to have our catalog of models out soon!!

focusgroup0 3 months ago

The fact that Apple didn't ship this in years after Siri acquisition is an indictment of its Product leadership

sanchitmonga22 3 months ago

Apple has the silicon, the frameworks (MLX, CoreML), and the models. The gap is putting it all together into a fast, unified on-device pipeline. That's what we're focused on, and honestly, we think Apple will eventually ship something similar natively. Until then, we're trying to show whats possible today on their hardware.
liuliu 3 months ago
This is not different from mlx-lm other than it uses a closed-source inference engine.
- sanchitmonga22 3 months ago
  
  Respectfully, the benchmarks show it is different.
  MetalRT and mlx-lm use the exact same model files, identical 4-bit MLX weights. That makes it a pure engine-to-engine comparison:
  LLM decode: MetalRT is 1.10-1.19x faster across all models tested
  STT: 70s audio in 101ms vs 463ms (4.6x faster)
  TTS: 178ms vs 493ms (2.8x faster)
  mlx-lm is a general-purpose array computation framework that also supports inference. MetalRT is purpose-built for inference only. That focus is where the performance gap comes from.
  You can reproduce these numbers yourself: rcli bench runs the same benchmarks we published. Full methodology: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...
  Yes, MetalRT is closed-source. We're transparent about that. The performance difference is the reason it exists.
- AmanSwar 3 months ago
  
  [dead]

john_strinlai 3 months ago

i knew i recognized this name from somewhere.

they are a company that registers domains similar to their main one, and then uses those domains to spam people they scrape off of github without affecting their main domain reputation.

edit: here is the post https://news.ycombinator.com/item?id=47163885

----

edit2: it appears that RunAnywhere is getting damage-control help by dang or tom.

this comment, at this time, has 23 upvotes yet is below 2 grey comments (i.e. <=0 upvotes) that were posted at roughly the same time (1 before, 1 after) -- strong evidence of artificial ordering by the moderators. gross.

Imustaskforhelp 3 months ago
Yup. The most crazy aspect was that they had bought the domain intentionally (just 1 month prior) that whole fiasco.
Maybe its just (n=2) that only we both remember this fiasco but I don't agree with that. I don't really understand how this got so so many upvotes in short frame of time especially given its history of not doing good things to say the very least... I am especially skeptical of it.
Thoughts?
Edit: I looked deeper into Sanchit's Hackernews id to find 3 days ago they posted the same thing as far as I can tell (the difference only being that it had runanywhere.ai domain than github.com/runanywhere but this can very well be because in hackernews you can't have two same links in small period of time so they are definitely skirting that law by pasting github link)
Another point, that post (https://news.ycombinator.com/item?id=47283498) got stuck at 5 points till right now (at time of writing)
So this got a lot more crazier now which is actually wild.
- john_strinlai 3 months ago
  
  i unfortunately dont know enough about vote patterns on hn, or what is expected/normal voting behavior.
  what i do know is that their name is etched into my mind under the category of "shady, never do business with them".
  
  1 reply →

Imustaskforhelp 3 months ago

I am just gonna link the stats of this hackernews post[0] and let public decide the rest because for context, this is same company which was mentioned in a blow-up post 12 days ago which had gotten 600 upvotes and they didn't respond back then[1] (I have found it hard for posts to have such a 2x factor within minutes of posting, that's just my personal observation. Usually one gets it after an hour or two or three.)

I was curious so I did some more research within the company to find more shady stuff going on like intentionally buying new domains a month prior to send that spam to not have the mail reputation of their website down. You can read my comment here[2]

Just to be on the safe side here, @dang (yes pinging doesn't work but still), can you give us some average stats of who are the people who upvoted this and an internal investigation if botting was done. I can be wrong about it and I don't ever mean to harm any company but I can't in good faith understand this. Some stats

Some stats I would want are: Average Karma/Words written/Date of the accounts who upvoted this post. I'd also like to know what the conclusion of internal investigation (might be) if one takes place.

[There is a bit of conflicts of interest with this being a YC product but I think that I trust hackernews moderator and dang to do what's right yeah]

I am just skeptical, that's all, and this is my opinion. I just want to provide some historical context into this company and I hope that I am not extrapolating too much.

It's just really strange to me, that's all.

[0]: https://news.ycombinator.com/reply?id=47165788

dang 3 months ago
The upvotes on the current post are fine - the reason you saw the submission rise in rank is that startup launch posts by YC startups get special placement on the front page (this is in the FAQ: https://news.ycombinator.com/newsfaq.html). Not every such post does, but some do.
In other words, your perception wasn't wrong, but the interpretation was off. I've put "Launch HN" and "YC W26" back in the title to make that clearer - I edited them out earlier, which was my mistake.
As for the booster comments, those are pretty common on launch threads and often pretty innocent - most people who aren't active HN users have no idea that it's against the rules. We do our best to communicate about that, but it's not a cardinal sin—there are far worse offenses.
- john_strinlai 3 months ago
  
  hi dang. while you are here -- are comments artificially ordered on this post?
  https://news.ycombinator.com/item?id=47326455). whats up dang?
  edit 3 (~1 hour later): you've responded to a handful of other comments and ignored this one as it becomes more and more evident that someone has artificially ordered the comments to ensure that critical comments are at the bottom of the page. it has shattered my perception of show/launch posts to know that you manually curate the comments to form a specific narrative. i really (naively) thought you guys were much more neutral about that sort of thing.
  
  12 replies →
- Imustaskforhelp 3 months ago
  
  Thanks dang but can you please explain there being two accounts who wrote something very small comment and one account being completely new and the other being 7 months old only being invoked in this case.
  Clearly I am not the only one here as john_strinlai here seems to have had somewhat of the same conclusion as me.
  Dang I know you care about this community so can you please talk more what you think about this in particular as well.
  I understand that YC companies get preferential treatment, Fine by me. But this feels something larger to me
  I have written everything that I could find in this thread from the same post being shown here 3 days ago in anywhere.ai link to now changing to github to skirt off HN rule that same link can't be posted in short period of time and everything.
  This feels somewhat intentional just like the spam issue, I hope you understand what I mean.
  (If you also feel suspicious, Can you then do a basic analysis/investigiation with all of these suspicious points in mind and everything please as well and upload the results in an anonymous way if possible?)
  I wish you to have a nice day and waiting for your thoughts on all of this.
  
  4 replies →

pzo 3 months ago

FWIW this RCLI is only MIT license but their engine MetalRT is commercial. Not sure the license of their models I guess also not MIT. So IMHO this repo is misleading.

Not sure why they decided to reinvent the wheel and write yet another ML engine (MetalRT) which is proprietary. I would most likely bet on CoreML since it have support for ANE (apple NPU) or MLX.

Other popular repos for such tasks I would recommend:

https://github.com/FluidInference/FluidAudio

https://github.com/DePasqualeOrg/mlx-swift-audio

https://github.com/Blaizzy/mlx-audio

https://github.com/k2-fsa/sherpa-onnx

shubham2802 3 months ago

Updating the readme asap - but thanks for the feedback. Also, please checkout few things : https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t... https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...
sanchitmonga22 3 months ago

Fair feedback on the README clarity, we've updated it to make the licensing distinction between RCLI (MIT) and MetalRT (proprietary) more prominent. That should have been clearer from day one.
On why we built MetalRT instead of using CoreML or MLX:
CoreML is optimized for classification and vision models, not autoregressive text generation. ANE is powerful for fixed-shape workloads but doesn't handle the dynamic shapes in LLM decode well.
MLX is much closer to what we need, and we respect what Apple has built. But MLX is a general-purpose array framework, it carries abstractions for developer ergonomics and portability that add overhead. MetalRT is purpose-built for inference only, and the numbers reflect that: 1.1-1.2x faster on LLMs (same model files) and 4.6x faster on STT.
We also needed one unified engine for LLM + STT + TTS rather than stitching three separate runtimes together. That doesn't exist in any of the alternatives listed.
The libraries you mentioned (FluidAudio, mlx-swift-audio, sherpa-onnx) are good projects. RCLI actually uses sherpa-onnx as it's fallback engine when MetalRT isn't installed. They solve different problems at different layers of the stack.
antipaul 3 months ago
Nice list.
What about for on-device RAG use cases?
- sanchitmonga22 3 months ago
  
  RCLI includes local RAG out of the box. You can ingest PDFs, DOCX, and plain text, then query by voice or text:
  rcli rag ingest ~/Documents/notes rcli ask --rag ~/Library/RCLI/index "summarize the project plan"
  It uses hybrid retrieval (vector + BM25 with Reciprocal Rank Fusion) and runs at ~4ms over 5K+ chunks. Embeddings are computed locally with Snowflake Arctic, so nothing leaves you're machine.
AmanSwar 3 months ago

[dead]

7kmph 3 months ago

this is the company that cold emailed many people via email on GitHub.

david_shaw 3 months ago

I think the title should read "RunAnywhere," not "RunAnwhere."

Imustaskforhelp 3 months ago
Dang has changed the title and it seems that he may have had a minor error doing it . Must have been a typo from his side changing it and that's okay! I think that Dang will update it sooner than later.
Edit: just reloaded, its fixed now.
- dang 3 months ago
  
  tomhow fixed it. I had looked at it multiple times and not noticed!

samuel_grupa_ai 3 months ago

[flagged]

dsalzman 3 months ago

[flagged]

iharnoor 3 months ago

[flagged]

Imustaskforhelp 3 months ago

This is a 7 month old account which has only responded to this particular comment.
And sorry to say but I don't think that Lets go!! is a valid comment, this makes me even more suspicious.
Especially given the history and suspicions I already had.

josuediaz 3 months ago

[flagged]

john_strinlai 3 months ago
josuediaz registered 4 minutes ago
iharnoor 1 karma, 1 comment, in this thread.
two posts pointing out their extremely unethical spam behavior both shot down to the very bottom of the post. apparently suspicious voting behavior.
what the hell is going on?
- Imustaskforhelp 3 months ago
  
  Yeah I am wondering the same thing.
  I was gonna comment about this guy and iharnoor which is 7 month old account who literally only said "lets go" here
  This sort of makes me even more suspicious john especially iharnoor
  I wasn't responding because I was making archive link of all of this so that even messages deleted can have some basis of confirmation.

sidv1711_ 3 months ago

Let's goo!!

brian-armstrong 3 months ago

What kind of self-disrespecting dev is using MacOS in TYOOL 2026?

JSR_FDED 3 months ago

The ones who like using local LLMs
The ones who like top-notch hardware
The ones who build stuff and don’t make a religious issue out of everything
ReaderOfRunes 3 months ago

Unfortunately it's the only laptop some companies provide their developers