← Back to context

Comment by mensetmanusman

2 years ago

maybe it is too much? If you just train LLM's on the entire Internet, it will be mostly garbage.

I have heard claims that lots of popular LLMs, including possibly gpt-4 are trained on things like reddit. so maybe it's not quite garbage in, garbage out if you include lots of other data. Google also has untold troves of data that is not widely available on the Web. including all the books from their decades long book indexing project.