Comment by ALLTaken
2 months ago
This is really not true my friend. I would love to help you if I had some more time, but let me look for a tutorial.
Because the FineWeb Dataset is already super good. You can train 7B or 32B Param models at home
The >600B Param model isn't really using all the data effectively, but with a MacStudio Farm you can also train that one at home (if you have enough money to buy at least 100).
Here's the easy way: https://github.com/FareedKhan-dev/train-deepseek-r1
More details: https://www.bentoml.com/blog/the-complete-guide-to-deepseek-...
Here's how DeepSeek-R1-Zero was built, basically from 0 to Hero, including weights the FULL Training Data and everything you need to get it running locally or on servers.https://medium.com/@GenerationAI/how-deepseek-r1-zero-was-re...
For $30 USD you can also train a small DeepSeek at home!
https://github.com/Jiayi-Pan/TinyZero
https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero (the model)
Ok, but those sources or methods will not reproducibly build the artifact that are the weights of DeepSeek R1 671B, that you claimed are "opensource". Because you can't see what they actually used to build it.
DeepSeek didn't publish the exact dataset required to create it. How is having zero visibility over "the source" used to create something considered "opensource"?
That extended definition of "opensource" is useless as almost anything that isn't unique in the universe can then be declared "opensource".