Comment by pickettd
2 days ago
I also want to add on that I really appreciate the benchmarks.
When I was working with RAG llama.cpp through RN early last year I had pretty acceptable tok/sec results up through 7-8b quantized models (on phones like the S24+ and iPhone 15pro). MLC was definitely higher tok/sec but it is really tough to beat the community support and availability in the gguf ecosystem.
No comments yet
Contribute on Hacker News ↗