ggerganov is my hero, and...
it's a good thing this got posted so I saw in the comments that --flash-attn --cache-reuse 256 could help with my setup (M3 36GB + RPC to M1 16GB) figuring out what params to set and at what value is a lot of trial and error, Gemini does help a bit clarify what params like top-k are going to do in practice. Still the whole load-balancing with RPC is something I think I'm going to have to read the source of llama.cpp to really understand (oops I almost wrote grok, damn you Elon) Anyways ollama is still not doing distributed load, and yeah I guess using it is a stepping stone...
This is the comment people should read. GG is amazing.
Ollama forked to get it working for day 1 compatibility. They need to get their system back in line with mainline because of that choice. That's kinda how open source works.
The uproar over this (mostly on reddit and x) seems unwarranted. New models regularly have compatibility issues for much longer than this.
The named anchor in this URL doesn't work in Safari. Safari correctly scrolls down to the comment in question, but then some Javascript on the page throws you back up to the top again.
I noticed it the other way, llama.cpp failed to download the Ollama-downloaded gpt-oss 20b model. Thought it was odd given all the others I tried worked fine.
Figured it had to be Ollama doing Ollama things, seems that was indeed the case.
ggerganov is my hero, and... it's a good thing this got posted so I saw in the comments that --flash-attn --cache-reuse 256 could help with my setup (M3 36GB + RPC to M1 16GB) figuring out what params to set and at what value is a lot of trial and error, Gemini does help a bit clarify what params like top-k are going to do in practice. Still the whole load-balancing with RPC is something I think I'm going to have to read the source of llama.cpp to really understand (oops I almost wrote grok, damn you Elon) Anyways ollama is still not doing distributed load, and yeah I guess using it is a stepping stone...
This is the comment people should read. GG is amazing.
Ollama forked to get it working for day 1 compatibility. They need to get their system back in line with mainline because of that choice. That's kinda how open source works.
The uproar over this (mostly on reddit and x) seems unwarranted. New models regularly have compatibility issues for much longer than this.
GG clearly mentioned they did not contribute anything to upstream.
The named anchor in this URL doesn't work in Safari. Safari correctly scrolls down to the comment in question, but then some Javascript on the page throws you back up to the top again.
I noticed it the other way, llama.cpp failed to download the Ollama-downloaded gpt-oss 20b model. Thought it was odd given all the others I tried worked fine.
Figured it had to be Ollama doing Ollama things, seems that was indeed the case.
Oh wow...
> llama.cpp failed to download the Ollama-downloaded gpt-oss 20b model
That should of course read
"llama.cpp failed to load the Ollama-downloaded gpt-oss 20b model"
ggerganov is a treasure. the man deserves a medal.