Comment by furyofantares

3 years ago

Is there any source for this, aside from it being oft repeated by internet speculators? Ilya has said the textual data situation is still quite good

If they're running into any limits in that respect, my bet would be that the limit would only on what is easily accessible to them without negotiating access, and that they can easily go another magnitude or two just with more incremental effort to strike deals. E.g. newspaper archives, national libraries and the like (I haven't looked at other languages, but GPT3's - since I don't know of any numbers for GPT4 - Norwegian corpus could easily be scaled at least two orders of magnitude with access to the Norwegian national library collection alone)

Depends on the quality. A ten trillion parameter model should require roughly 10 trillion tokens to train. Put another way, this would be roughly 10k Wikipedia’s or 67 Million books, Roughly 3-4 GitHub’s.

It’s been established that LLMs are sensitive to corpus selection which is part of why we see anecdotal variance in quality across different LLM releases.

While we could increase the corpus of text by loading social media comments, self published books, and other similar text - this may negatively impact final model quality/utility.

yeah i need a source on this. GPT3 corpus is, what, a few hundred TB? absoultely nowhere near the total amount of tokens we could collect eg from youtube/podcasts