Comment by Deathmax
6 months ago
It's simple enough to test the tokenizer to determine the base model in use (DeepSeek V3, or a Llama 3/Qwen 2.5 distill).
Using the text "സ്മാർട്ട്", Qwen 2.5 tokenizes as 10 tokens, Llama 3 as 13, and DeepSeek V3 as 8.
Using DeepSeek's chat frontend, both DeepSeek V3 and R1 returns the following response (SSE events edited for brevity):
{"content":"സ","type":"text"},"chunk_token_usage":1
{"content":"്മ","type":"text"},"chunk_token_usage":2
{"content":"ാ","type":"text"},"chunk_token_usage":1
{"content":"ർ","type":"text"},"chunk_token_usage":1
{"content":"ട","type":"text"},"chunk_token_usage":1
{"content":"്ട","type":"text"},"chunk_token_usage":1
{"content":"്","type":"text"},"chunk_token_usage":1
which totals to 8, as expected for DeepSeek V3's tokenizer.
I’m not sure I understand what this comment is responding to. Wouldn’t a distilled Deepseek still use the same tokenizer? I’m not claiming they are using llama in their backend. I’m just saying they are likely using a lower-parameter model too.
The small models that have been published as part of the DeepSeek release are not a "distilled DeepSeek", they're fine-tuned varieties of Llama and Qwen. DeepSeek may have smaller models internally that are not Llama- or Qwen-based but if so they haven't released them.
Thank you. I’m still learning as I’m sure everyone else is, and that’s a distinction I wasn’t aware of. (I assumed “distilled” meant a compressed parameter size, not necessarily the use of another model in its construction.)