Comment by polotics
18 hours ago
ggerganov is my hero, and... it's a good thing this got posted so I saw in the comments that --flash-attn --cache-reuse 256 could help with my setup (M3 36GB + RPC to M1 16GB) figuring out what params to set and at what value is a lot of trial and error, Gemini does help a bit clarify what params like top-k are going to do in practice. Still the whole load-balancing with RPC is something I think I'm going to have to read the source of llama.cpp to really understand (oops I almost wrote grok, damn you Elon) Anyways ollama is still not doing distributed load, and yeah I guess using it is a stepping stone...
No comments yet
Contribute on Hacker News ↗