Comment by fc417fc802
12 days ago
Training methods and architectures keep getting more efficient by leaps and bounds and scaling up was well into the realm of diminishing returns last I checked. The necessity of exceeding 100B seems questionable. Just because you can get some benefits by piling ever more data on doesn't necessarily mean you have to.
Also keep in mind we aren't talking about a small company wanting to do competitive R&D on a frontier model. We're talking about a world superpower that operates nuclear reactors and built something the size of the three gorges dam deciding that a thing is strategically necessary. If they were willing to spend the money I am absolutely certain that they could pull it off.
I disagree with the original position that "You could train a state of the art model on cluster of 12+ year old boxes". Regardless of the country's resources, the best training methods can't makeup for the vast difference in compute and scale. The best 100 or 70B models aren't close to GPT, Gemini, or Claude; and there's certainly no chance the best 100B models could've been trained with the compute reasonably available in a single source 10 years ago.