Comment by Deathmax

1 year ago

The most recent one of the top of my head is their horrendous aliasing of DeepSeek R1 on their model hub, misleading users into thinking they are running the full model but really anything but the 671b alias is one of the distilled models. This has already led to lots of people claiming that they are running R1 locally when they are not.

9 comments

Deathmax

TeMPOraL 1 year ago

The whole DeepSeek-R1 situation gets extra confusing because:

- The distilled models are also provided by DeepSeek;

- There's also dynamic quants of (non-distilled) R1 - see [0]. Those, as I understand it, are more "real R1" than the distilled models, and you can get as low as ~140GB file size with the 1.58-bit quant.

I actually managed to get the 1.58-bit dynamic quant running on my personal PC, with 32GB RAM, at about 0.11 tokens per second. That is, roughly six tokens per minute. That was with llama.cpp via LM Studio; using Vulkan for GPU offload (up to 4 layers for my RTX 4070 Ti with 12GB VRAM :/) actually slowed things down relative to running purely on the CPU, but either way, it's too slow to be useful with such specs.

[0] - https://unsloth.ai/blog/deepseekr1-dynamic

zozbot234 1 year ago

> it's too slow to be useful with such specs.
Only if you insist on realtime output: if you're OK with posting your question to the model and letting it run overnight (or, for some shorter questions, over your lunch break) it's great. I believe that this use case can fit local-AI especially well.

adastra22 1 year ago

I'm not sure that's fair, given that the distilled models are almost as good. Do you really think Deepseek's web interface is giving you access to 671b? They're going to be running distilled models there too.

Deathmax 1 year ago
It's simple enough to test the tokenizer to determine the base model in use (DeepSeek V3, or a Llama 3/Qwen 2.5 distill).
Using the text "സ്മാർട്ട്", Qwen 2.5 tokenizes as 10 tokens, Llama 3 as 13, and DeepSeek V3 as 8.
Using DeepSeek's chat frontend, both DeepSeek V3 and R1 returns the following response (SSE events edited for brevity):
{"content":"സ","type":"text"},"chunk_token_usage":1 {"content":"്മ","type":"text"},"chunk_token_usage":2 {"content":"ാ","type":"text"},"chunk_token_usage":1 {"content":"ർ","type":"text"},"chunk_token_usage":1 {"content":"ട","type":"text"},"chunk_token_usage":1 {"content":"്ട","type":"text"},"chunk_token_usage":1 {"content":"്","type":"text"},"chunk_token_usage":1
which totals to 8, as expected for DeepSeek V3's tokenizer.
- adastra22 1 year ago
  
  I’m not sure I understand what this comment is responding to. Wouldn’t a distilled Deepseek still use the same tokenizer? I’m not claiming they are using llama in their backend. I’m just saying they are likely using a lower-parameter model too.
  
  2 replies →
zozbot234 1 year ago
Given that the 671B model is reportedly MoE-based, it definitely could be powering the web interface and API. MoE slashes the per-inference compute cost - and when serving the model for multiple users you only have to host a single copy of the model params in memory, so the bulk doesn't hurt you as much.
- adastra22 1 year ago
  
  They can still run a lot more users on the same number of GPUs (and they don't have a lot) using distilled models.