Comment by metadat

1 year ago

I actually did imagine many of these data sources (and some of your ideas are new to me :), but question the level of additional useful capability they would provide for an LLM in the context of responding to user queries. Is more data always better? Or what level of curation results in the most useful model?

At some point I expect putting in too much data from semi-random or very old sources will have a detrimental effect on output quality.

In the extreme case, you could feed /dev/urandom. Haha, only kidding, but I'm sure you get my idea.

Now I'm wondering what a model trained in the past 45 years of Usenet would be like. Or all of the history of public messages on IRC servers like EFNet or Freenode (afaik they are not fully logged). It is an interesting topic, but I'm still curious and uncertain what the effect adding some multiples of data in the form of often lower fidelity sources (e.g. WhatsApp messages) will have on the capability of the final model. It's hard to understand how such sources would be helpful or useful.