Comment by indigodaddy

21 hours ago

ggerganov explains the issue: https://github.com/ollama/ollama/issues/11714#issuecomment-3...

7 comments

indigodaddy

ggerganov is my hero, and... it's a good thing this got posted so I saw in the comments that --flash-attn --cache-reuse 256 could help with my setup (M3 36GB + RPC to M1 16GB) figuring out what params to set and at what value is a lot of trial and error, Gemini does help a bit clarify what params like top-k are going to do in practice. Still the whole load-balancing with RPC is something I think I'm going to have to read the source of llama.cpp to really understand (oops I almost wrote grok, damn you Elon) Anyways ollama is still not doing distributed load, and yeah I guess using it is a stepping stone...

scosman 16 hours ago

This is the comment people should read. GG is amazing.

Ollama forked to get it working for day 1 compatibility. They need to get their system back in line with mainline because of that choice. That's kinda how open source works.

The uproar over this (mostly on reddit and x) seems unwarranted. New models regularly have compatibility issues for much longer than this.

ekianjo 16 hours ago

GG clearly mentioned they did not contribute anything to upstream.

LeoPanthera 18 hours ago

The named anchor in this URL doesn't work in Safari. Safari correctly scrolls down to the comment in question, but then some Javascript on the page throws you back up to the top again.

magicalhippo 21 hours ago

I noticed it the other way, llama.cpp failed to download the Ollama-downloaded gpt-oss 20b model. Thought it was odd given all the others I tried worked fine.

Figured it had to be Ollama doing Ollama things, seems that was indeed the case.

magicalhippo 4 hours ago

Oh wow...
> llama.cpp failed to download the Ollama-downloaded gpt-oss 20b model
That should of course read
"llama.cpp failed to load the Ollama-downloaded gpt-oss 20b model"

buyucu 9 hours ago

ggerganov is a treasure. the man deserves a medal.