← Back to context

Comment by the_mitsuhiko

6 months ago

Ollama needs competition. I’m not sure what drives the people that maintain it but some of their actions imply that there are ulterior motives at play that do not have the benefit of their users in mind.

However such projects require a lot of time and effort and it’s not clear if this project can be forked and kept alive.

The most recent one of the top of my head is their horrendous aliasing of DeepSeek R1 on their model hub, misleading users into thinking they are running the full model but really anything but the 671b alias is one of the distilled models. This has already led to lots of people claiming that they are running R1 locally when they are not.

  • The whole DeepSeek-R1 situation gets extra confusing because:

    - The distilled models are also provided by DeepSeek;

    - There's also dynamic quants of (non-distilled) R1 - see [0]. Those, as I understand it, are more "real R1" than the distilled models, and you can get as low as ~140GB file size with the 1.58-bit quant.

    I actually managed to get the 1.58-bit dynamic quant running on my personal PC, with 32GB RAM, at about 0.11 tokens per second. That is, roughly six tokens per minute. That was with llama.cpp via LM Studio; using Vulkan for GPU offload (up to 4 layers for my RTX 4070 Ti with 12GB VRAM :/) actually slowed things down relative to running purely on the CPU, but either way, it's too slow to be useful with such specs.

    --

    [0] - https://unsloth.ai/blog/deepseekr1-dynamic

    • > it's too slow to be useful with such specs.

      Only if you insist on realtime output: if you're OK with posting your question to the model and letting it run overnight (or, for some shorter questions, over your lunch break) it's great. I believe that this use case can fit local-AI especially well.

  • I'm not sure that's fair, given that the distilled models are almost as good. Do you really think Deepseek's web interface is giving you access to 671b? They're going to be running distilled models there too.

    • It's simple enough to test the tokenizer to determine the base model in use (DeepSeek V3, or a Llama 3/Qwen 2.5 distill).

      Using the text "സ്മാർട്ട്", Qwen 2.5 tokenizes as 10 tokens, Llama 3 as 13, and DeepSeek V3 as 8.

      Using DeepSeek's chat frontend, both DeepSeek V3 and R1 returns the following response (SSE events edited for brevity):

        {"content":"സ","type":"text"},"chunk_token_usage":1
        {"content":"്മ","type":"text"},"chunk_token_usage":2
        {"content":"ാ","type":"text"},"chunk_token_usage":1
        {"content":"ർ","type":"text"},"chunk_token_usage":1
        {"content":"ട","type":"text"},"chunk_token_usage":1
        {"content":"്ട","type":"text"},"chunk_token_usage":1
        {"content":"്","type":"text"},"chunk_token_usage":1
      

      which totals to 8, as expected for DeepSeek V3's tokenizer.

      3 replies →

    • Given that the 671B model is reportedly MoE-based, it definitely could be powering the web interface and API. MoE slashes the per-inference compute cost - and when serving the model for multiple users you only have to host a single copy of the model params in memory, so the bulk doesn't hurt you as much.

      1 reply →

Can you please explain why you think they may be operating in bad faith?

  • Not parent, but same feeling.

    First I got the feeling because of how they store things on disk and try to get all models rehosted in their own closed library.

    Second time I got the feeling is when it's not obvious at all about what their motives are, and that it's a for-profit venture.

    Third time is trying to discuss things in their Discord and the moderators there constantly shut down a lot of conversation citing "Misinformation" and rewrites your messages. You can ask a honest question, it gets deleted and you get blocked for a day.

    Just today I asked why the R1 models they're shipping that are the distilled ones, doesn't have "distilled" in the name, or even any way of knowing which tag is which model, and got the answer "if you don't like how things are done on Ollama, you can run your own object registry" which doesn't exactly inspire confidence.

    Another thing I noticed after a while is that there are bunch of people with zero knowledge of terminals that want to run Ollama, even though Ollama is a project for developers (since you do need to know how to run a terminal). Just making the messaging clearer would help a lot in this regarding, but somehow the Ollama team thinks thats gatekeeping and it's better to teach people basic terminal operations.

    • For what it's worth, HuggingFace provides documentation on how you can run any GGUF model inside Ollama[0]. You're not locked into their closed library or have to wait for them to add new models.

      Granted, they could be a lot more helpful in providing information on how you do this. But this feature exists, at least.

      [0] https://huggingface.co/docs/hub/en/ollama

    • Ollama team's response (verbatim) when asking what they think of the comments about Ollama in this HN submission: "Who cares? It's the internet... everybody has an opinion... and they're usually bad". Not exactly the response you'd expect from people who should ideally learn from what others think (correct or not) about your project.

Ollama doesn't really need competition. Llama.cpp just needs a few usability updates to the gguf format so that you can specify a hugging face repository like you can do in vLLM already.

I totally agree that ollama needs competition. They have been doing very sketchy things lately. I wish llama.cpp had an alternative wrapper client like ollama.

agreed. but what's wrong with Jan? does ollama utilize resources/run models more efficiently under the hood? (sorry for the naivete)