Comment by aesthesia

8 days ago

If you really want to see fully open training pipelines for modern LLMs, Olmo and to a lesser extent Nemotron are what you should look at.

https://github.com/allenai/OLMo

https://github.com/NVIDIA-NeMo/Nemotron

5 comments

aesthesia

achrono 8 days ago

After my own very exhaustive survey, I can just say '+1' and also good to note that OLMo has actually had one independent reproduction (albeit not open) done: https://www.amd.com/en/developer/resources/technical-article...

I often wonder why OLMo and Nemotron aren't more popular -- they are gold-standard / "frontier" of a year ago. If we had more support behind these, seeing a true open-source AI system that legitimately challenges OpenAI & Anthropic might not be far away!

yowlingcat 7 days ago

It might change soon. Nemotron 120b was never flashy but always well regarded in the community and had material strengths at long context. The 550b next gen version is out now and still very fresh. It is too early to tell but for some reason I believe the impact it will eventually have is quite strong. NVIDIA open weight models are really good. They're not flashy but they're always well put together, well documented, well licensed, and in general make for truly great bases for customization - whether it's Nemotron or Cosmos.
Cosmos 2 in particular already has taken the image diffusion world by storm in a finetune (Anima) essentially replacing/dethroning the previous budget king SDXL. I wonder if the newest Nemotron could have the same impact for open weights LLM?

spijdar 8 days ago

I'm not really familiar with either, but I'm more familiar with Olmo. My impression is Nemotron is newer -- why is it less applicable? Is it not totally open like Olmo?

lambda 8 days ago
Olmo releases their full datasets.
Nemotron only releases portions of some of their datasets, like the source code dataset that they pretrain on.
For example, from https://docs.nvidia.com/nemotron/latest/nemotron/super3/pret... :
Open-source data coverage: The released datasets cover an estimated 8–10T tokens (~40–50% of the internal 25T blend). Missing categories include code (~14% of blend), nemotron-cc-code (~2%), crawl++ (~2%), and academic text (~2%). Users should supplement with their own data for these categories and adjust train_iters accordingly.
K2 Think V2 is another fully open model like Olmo, with full datasets released.
Note that the Nemotron models are generally stronger than Olmo and K2 Think V2 (according to Artificial Analysis benchmarks), and there is a lot of overlap in their datasets (lots of datasets are based on the same sources with different filtering, Olmo and K2 Think V2 both have used some Nemotron datasets).
But yeah, Nemotron is a modern and fairly capable LLM, even the 122b is more capable than Deepseek R1 (a 671b model) on most benchmarks, and there's also the recently released 550b Ultra now.
It does have a fully open training recipe, just some data missing from its datasets, but if you want a fully open pipeline it's going to be a good place to start, you just need to find some more data to fill in the datasets to get up to the token count with reasonably high quality data.

gnerd00 7 days ago

great to see OLMO mentioned here; interesting project