Comment by bxtt
2 months ago
CoT is widely known technique - what became fully novel was the level of training embedding CoT via RL with optimal reward trajectory. DeepSeek took it further due to their compute restriction to find memory, bandwidth, parallelism optimizations in every part (GRPO - reducing memory copies, DualPipe for data batch parallelism between memory & compute, kernel bypasses (PTX level optimization), etc.) - then even using MoE due to sparse activation and further distillation. They operated on the power scaling laws of parameters & tokens but high quality data circumvents this. I’m not surprised they utilized synthetic generation from OpenAI or copied the premise of CoT, but where they should get the most credit is their infra level & software level optimizations.
With that being said, I don’t think the benchmarks we currently have are strong enough and the next frontier models are yet to come. I’m sure at this point U.S LLM research firms now understand their lack of infra/hardware optimizations (they just threw compute at the problem), they will begin paying closer attention. Now their RL-level and parent training will become even greater - whilst the newly freed resources to solve for sub-optimizations that have been traditionally avoided due to computational overhead
No comments yet
Contribute on Hacker News ↗