Comment by sanchitmonga22

3 months ago

Fair criticism. Our benchmarks are on small models because MetalRT was built for the voice pipeline use case, where decode latency on 0.6B-4B models is the bottleneck.

You're right that the bigger opportunity on Apple Silicon is large models that don't fit on consumer GPUs. Expanding MetalRT to 7B, 14B, 32B+ is on the roadmap. The architectural advantages(that MetalRT has) should matter even more at that scale where everything becomes memory-bandwidth-bound.

We'll publish benchmarks on larger models as we add support. If you have a specific model/size you'd want to see first, that helps us prioritize.