Comment by rjzzleep
13 hours ago
Georgi gave a response to some of the issues ollama has in the attached thread[1]
> Looking at ollama's modifications in ggml, they have too much branching in their MXFP4 kernels and the attention sinks implementation is really inefficient. Along with other inefficiencies, I expect the performance is going to be quite bad in ollama.
ollama responded to that
> Ollama has worked to correctly implement MXFP4, and for launch we've worked to validate correctness against the reference implementations against OpenAI's own. > Will share more later, but here is some testing from the public (@ivanfioravanti ) not done by us - and not paid or
leading to another response
> I am sure you worked hard and did your best. > But, this ollama TG graph makes no sense - speed cannot increase at larger context. Do you by any chance limit the context to 8k tokens? > Why is 16k total processing time less than 8k?
Whether or not Ollama's claim is right, I find this "we used your thing, but we know better, we'll share details later" behaviour a bit weird.
ollama has always had a weird attitude towards upstream, and then they wonder why many in the community don't like them
> they wonder why many in the community don't like them
Do they? They probably care more about their "partners".
As GP said: