Comment by lumost

3 years ago

We are also starting to run out of high quality corpus to train on at such model scales. While Video offers another large set of data, we'll have to look at further RL approaches in the next few years to continue scaling datasets.

Is there any source for this, aside from it being oft repeated by internet speculators? Ilya has said the textual data situation is still quite good

  • If they're running into any limits in that respect, my bet would be that the limit would only on what is easily accessible to them without negotiating access, and that they can easily go another magnitude or two just with more incremental effort to strike deals. E.g. newspaper archives, national libraries and the like (I haven't looked at other languages, but GPT3's - since I don't know of any numbers for GPT4 - Norwegian corpus could easily be scaled at least two orders of magnitude with access to the Norwegian national library collection alone)

  • Depends on the quality. A ten trillion parameter model should require roughly 10 trillion tokens to train. Put another way, this would be roughly 10k Wikipedia’s or 67 Million books, Roughly 3-4 GitHub’s.

    It’s been established that LLMs are sensitive to corpus selection which is part of why we see anecdotal variance in quality across different LLM releases.

    While we could increase the corpus of text by loading social media comments, self published books, and other similar text - this may negatively impact final model quality/utility.

  • yeah i need a source on this. GPT3 corpus is, what, a few hundred TB? absoultely nowhere near the total amount of tokens we could collect eg from youtube/podcasts