Comment by lumost

3 years ago

Depends on the quality. A ten trillion parameter model should require roughly 10 trillion tokens to train. Put another way, this would be roughly 10k Wikipedia’s or 67 Million books, Roughly 3-4 GitHub’s.

It’s been established that LLMs are sensitive to corpus selection which is part of why we see anecdotal variance in quality across different LLM releases.

While we could increase the corpus of text by loading social media comments, self published books, and other similar text - this may negatively impact final model quality/utility.