Comment by marcd35
10 hours ago
While Qwen2.5 was pre-trained on 18 trillion tokens, Qwen3 uses nearly twice that amount, with approximately 36 trillion tokens covering 119 languages and dialects.
10 hours ago
While Qwen2.5 was pre-trained on 18 trillion tokens, Qwen3 uses nearly twice that amount, with approximately 36 trillion tokens covering 119 languages and dialects.
Thanks for the info, but I don't think it answers the question. I mean, you could train a 20-node network on 36 trillion tokens. Wouldn't make much sense, but you could. So I was asking more about the number of nodes / parameters or GB of file size.
In addition, there seem to be many different versions of Qwen3. E.g. here the list from ollama library: https://ollama.com/library/qwen3/tags
This is the Max series models with unreleased weights, so probably larger than the largest released one. Also when refering to models, use huggingface or modelscope (wherever it is published) ollama is a really poor source on model info. they have some some bad naming (like confusing people on the deepseek R1 models), renaming, and more on model names, and they default to q4 quants, witch is a good sweet-spot but really degrades performance compared to the raw weigths.