Comment by asjir

1 day ago

I'd be more concerned about the size used being 70b (deepseek r1 has 671b) which makes catching up with SOTA kinda more difficult to begin with.

SOTA performance is relative to model size. If it performs better than other models in the 70B range (e.g. Llama 3.3) then it could be quite useful. Not everyone has the VRAM to run the full fat Deepseek R1.

  • also isn't DeepSeek's Mixture of Experts? meaning not all params get ever activated on one forward pass?

    70B feels like the best balance between usable locally and decent for regular use.

    maybe not SOTA, but a great first step.