← Back to context

Comment by rjzzleep

13 hours ago

Georgi gave a response to some of the issues ollama has in the attached thread[1]

> Looking at ollama's modifications in ggml, they have too much branching in their MXFP4 kernels and the attention sinks implementation is really inefficient. Along with other inefficiencies, I expect the performance is going to be quite bad in ollama.

ollama responded to that

> Ollama has worked to correctly implement MXFP4, and for launch we've worked to validate correctness against the reference implementations against OpenAI's own. > Will share more later, but here is some testing from the public (@ivanfioravanti ) not done by us - and not paid or

leading to another response

> I am sure you worked hard and did your best. > But, this ollama TG graph makes no sense - speed cannot increase at larger context. Do you by any chance limit the context to 8k tokens? > Why is 16k total processing time less than 8k?

Whether or not Ollama's claim is right, I find this "we used your thing, but we know better, we'll share details later" behaviour a bit weird.

[1] https://x.com/ggerganov/status/1953088008816619637

ollama has always had a weird attitude towards upstream, and then they wonder why many in the community don't like them

  • > they wonder why many in the community don't like them

    Do they? They probably care more about their "partners".

    As GP said:

      By reimplementing this layer, Ollama gets to enjoy a kind of LTS status that their partners rely on