Comment by adastra22

6 months ago

I'm not sure that's fair, given that the distilled models are almost as good. Do you really think Deepseek's web interface is giving you access to 671b? They're going to be running distilled models there too.

6 comments

adastra22

Deathmax 6 months ago

It's simple enough to test the tokenizer to determine the base model in use (DeepSeek V3, or a Llama 3/Qwen 2.5 distill).

Using the text "സ്മാർട്ട്", Qwen 2.5 tokenizes as 10 tokens, Llama 3 as 13, and DeepSeek V3 as 8.

Using DeepSeek's chat frontend, both DeepSeek V3 and R1 returns the following response (SSE events edited for brevity):

  {"content":"സ","type":"text"},"chunk_token_usage":1
  {"content":"്മ","type":"text"},"chunk_token_usage":2
  {"content":"ാ","type":"text"},"chunk_token_usage":1
  {"content":"ർ","type":"text"},"chunk_token_usage":1
  {"content":"ട","type":"text"},"chunk_token_usage":1
  {"content":"്ട","type":"text"},"chunk_token_usage":1
  {"content":"്","type":"text"},"chunk_token_usage":1

which totals to 8, as expected for DeepSeek V3's tokenizer.

adastra22 6 months ago
I’m not sure I understand what this comment is responding to. Wouldn’t a distilled Deepseek still use the same tokenizer? I’m not claiming they are using llama in their backend. I’m just saying they are likely using a lower-parameter model too.
- zozbot234 6 months ago
  
  The small models that have been published as part of the DeepSeek release are not a "distilled DeepSeek", they're fine-tuned varieties of Llama and Qwen. DeepSeek may have smaller models internally that are not Llama- or Qwen-based but if so they haven't released them.
  
  1 reply →

zozbot234 6 months ago

Given that the 671B model is reportedly MoE-based, it definitely could be powering the web interface and API. MoE slashes the per-inference compute cost - and when serving the model for multiple users you only have to host a single copy of the model params in memory, so the bulk doesn't hurt you as much.

adastra22 6 months ago

They can still run a lot more users on the same number of GPUs (and they don't have a lot) using distilled models.