← Back to context

Comment by wkat4242

6 months ago

That's a weird thing about Ollama yes.

It took very long for them to support KV cache quantisation too (which drastically reduces the amount of VRAM needed for context!). Even though the underlying llama.cpp had offered it for ages. And they had it handed to them on a platter, someone had developed everything and submitted a patch.

The developer of that patch even was about to give up as he had to constantly keep it up to date with upstream even though he was constantly being ignored. So he had no idea if it would ever be merged.

They just seem to be really hesitant to offer new features.

Eventually it was merged and it made a huge difference to people with low VRAM cards.