The specific sequence of tokens that comprise the Knuth's problem with an answer to it is not in the training data. A naive probability distribution based on counting token sequences that are present in the training data would assign 0 probability to it. The trained network represents extremely non-naive approach to estimating the ground-truth distribution (the distribution that corresponds to what a human brain might have produced).
>the distribution that corresponds to what a human brain might have produced..
But the human brain (or any other intelligent brain) does not work by generating probability distribution of the next word. Even beings that does not have a language can think and act intelligent.
LLMs also don't work by generating probability distributions of the next word. Your explanation isn't able to explain why they can generate words, let alone sentences.
The specific sequence of tokens that comprise the Knuth's problem with an answer to it is not in the training data. A naive probability distribution based on counting token sequences that are present in the training data would assign 0 probability to it. The trained network represents extremely non-naive approach to estimating the ground-truth distribution (the distribution that corresponds to what a human brain might have produced).
>the distribution that corresponds to what a human brain might have produced..
But the human brain (or any other intelligent brain) does not work by generating probability distribution of the next word. Even beings that does not have a language can think and act intelligent.
LLMs also don't work by generating probability distributions of the next word. Your explanation isn't able to explain why they can generate words, let alone sentences.
1 reply →