Comment by zoogeny
1 year ago
My take is that this has more to do with the coming years than the current climate.
I think it is just a consequence of the cost of getting to the next level of AI. The estimates for training a GPT-5 level foundational model are on the order of 1 billion. It isn't going to get cheaper from there. So even if your model is a bit better than the free models available today, unless you are spending that 1 billion+ today then you are going to look weak in 6 months to 1 year. And by then the GPT-6+ model training costs will be even higher, so you can't just wait and play catch up. You are probably right as well, in that there is a fear that a competitor based on an open source model gets close enough in capability to generate bad publicity.
I imagine character.ai (like inflection) did calculations and realized that there was no clear path to recoup that magnitude of investment based on their current product lines. And when they brainstormed ways to increase return they found that none of the paths strictly required a proprietary foundational model. Just my speculation, of course.
What does "GPT-5" and "GPT-6" even mean? I gently suggest they aren't currently meaningful, it's not like CPU Ghz frequency steppings. If anything it's more akin to chip fab processes, e.g. 10nm, 5nm, 3nm. Each reduction in feature size requires new physical technology, chip architecture, and a black box bag of tricks to eke out better performance.
Where is the data that costs a billion dollars to train on going to come from? These companies are already training on most of the available valuable information that exists.
While training will surely be expensive, I think it's even more expensive and challenging to organize and harness the brainpower to figure out and execute the next meaningful step forward.
I think the comparison with fab nodes is suitable. We do not know how much performance gains are going to come from it, but we do know it is going to be very expensive.
Data availability for LLM is becoming trickier. There are at least two avenues being explored: A) Synthetic data (in controlled ways) and B) Video data, in particular multi-modal embeddings between image/audio/text sequences. This may enable several magnitudes of increase in compute.
> What does "GPT-5" and "GPT-6" even mean?
That is a short hand for "the next two generations of LLMs created by Open AI". It is not meant to be a forward looking statement on how those models will be branded in the consumer market. It also isn't meant to be a prophecy that OpenAI will maintain its premier position since Anthropic or even a new entrant into the field might be the company to achieve that next step level.
> I think it's even more expensive and challenging to organize and harness the brainpower
Then you should invest with that in mind. What I find interesting is that Microsoft (with its acquisition of the research arm of inflection) and Google (with its acquisition of the research arm of character.ai) seem to see the foundational model and product categories as distinct. It is that distinction I am interested in.
There is no doubt some huge value in productizing these LLMs. However, it appears that the productization of LLMs and the advancement of the foundational models themselves are being decoupled by the market. That is, it seems they are segregating risk. Product companies can raise money to build products, "platform" companies (e.g. Microsoft, Google) can raise money to build foundational models. What seems less popular based on these recent moves is companies able to raise money to build foundational models for the purposes of specific products.
> Where is the data that costs a billion dollars to train on going to come from? These companies are already training on most of the available valuable information that exists.
This is a lack of imagination.
All books written; all movies that exist on dvd; all music released in cd; all tv programs; all radio programs; all whatsapp messages; all of youtube; blueprints from architecture and mechanical engineering.
The copyright and logistics are definitely an issue, but there is more data.
I actually did imagine many of these data sources (and some of your ideas are new to me :), but question the level of additional useful capability they would provide for an LLM in the context of responding to user queries. Is more data always better? Or what level of curation results in the most useful model?
At some point I expect putting in too much data from semi-random or very old sources will have a detrimental effect on output quality.
In the extreme case, you could feed /dev/urandom. Haha, only kidding, but I'm sure you get my idea.
Now I'm wondering what a model trained in the past 45 years of Usenet would be like. Or all of the history of public messages on IRC servers like EFNet or Freenode (afaik they are not fully logged). It is an interesting topic, but I'm still curious and uncertain what the effect adding some multiples of data in the form of often lower fidelity sources (e.g. WhatsApp messages) will have on the capability of the final model. It's hard to understand how such sources would be helpful or useful.