Comment by pradn
2 years ago
No, Google is on a more level playing field than you think. It certainly can't train on enterprise data, and of course not on private user data like emails. Cross-division data sharing is tough as well, because regulators don't like it for anti-monopoly reasons. OpenAI can scrape YouTube all it wants, but DeepMind may not be able to just train against all of YouTube just like that.
We might soon get to a point where every player is using pretty much all the low-cost data there is. Everyone will use all the public internet data there is, augmented by as much private datasets as they can afford.
The improvements we can expect to see in the next few years look like a Drake equation.
LLM performance delta = data quality x data quantity x transformer architecture tweaks x compute cost x talent x time.
The ceiling for the cost parameters in this equation are determined by expected market opportunity, at the margin - how much more of the market can you capture if you have the better tech.
> DeepMind may not be able to just train against all of YouTube just like that
What? Why?
> data quality x data quantity x transformer architecture tweaks x compute cost x talent x time.
Google arguably has the most data (it's search index), the best data (ranked and curated already, along with data sets like books), the cheapest compute (they literally run their own cloud offering and are one of the biggest purchasers of H100s), and the oldest and most mature ML team.