Comment by mjburgess
2 months ago
I don't see how the "different data" aspect is evidenced. If the "modality" of the data is the same, we're choosing a highly specific subset of all possible data -- and, in practice, radically more narrow than just that. Any sufficiently capable LLM is going to have to be trained on a corpus not-so-dissimilar to all electronic texts which exist in the standard corpa used for LLM training.
The idea that a data set is "different" merely because its some subset of this maximal corpa is a difference without a distinction. What isnt being proposed is, say, that training just on all the works of scifi fiction lead to a zero-info translatable embedding space projectable into all the works of horror, and the like (or say that english-scifi can be bridged to japanese-scifi by way of a english-japanese-horror-corpus).
The very objective of creating LLMs with useful capabilities entials an extremely similar dataset starting point. We do not have so many petabytes of training data here that there is any meaningful sense in which OpenAI uses "only this discrete subspace" and perplextiy, "yet another". All useful LLMs sample roughly randomly across the maximal corpus that we have to hand.
Thus this hype around there being a platonic form of how word tokens ought be arranged seems wholly unevidenced. Reality has a "natural arrangement" -- this does not show that our highly lossy encoding of it in english has anything like a unique or natural correspondence. It has a circumstantial correspondance in "all recorded electronic texts" which are the basis for training all generally useful LLMs.
No comments yet
Contribute on Hacker News ↗