← Back to context

Comment by Sohcahtoa82

11 days ago

I'm fully open to being corrected. Just telling me I'm wrong without elaborating does absolutely nothing to foster understanding and learning.

If you still think there's something left to explain, I recommend you read your other responses. Being restricted to the training data is not a property of Markov output. You'd have to be very, very badly confused to think that it was. (And it should be noted that a Markov chain itself doesn't contain any training data, as is also true of an LLM.)

More generally, since an LLM is a Markov chain, it doesn't make sense to try to answer the question "what's the difference between an LLM and a Markov chain?" Here, the question is "what's the difference between a tiny LLM and a Markov chain?", and assuming "tiny" refers to window size, and the Markov chain has a similarly tiny window size, they are the same thing.

  • An LLM is not a Markov chain of the input tokens, because it has internal computational state (the KV cache and residuals).

    An LLM is a Markov process if you include its entire state, but that's a pretty degenerate definition.

    • > An LLM is a Markov process if you include its entire state, but that's a pretty degenerate definition.

      Not any more degenerate than a multi word bag of words markov chain, its exactly the same concept: you input a context of words / tokens and get a new word / token, the things you mention there are just optimizations around that abstraction.

      1 reply →

  • He said LLMs are creative, yet people have been telling me that LLMs cannot solve problems that is not in their training data. I want this to be clarified or elaborated on.

    • Make up a fanciful problem and ask it to solve it. For example, https://chatgpt.com/s/t_691f6c260d38819193de0374f090925a is unlikely to be found in the training data - I just made it up. Another example of wizards and witches and warriors and summoning... https://chatgpt.com/share/691f6cfe-cfc8-8011-b8ca-70e2c22d36... - I doubt that was in the training data either.

      Make up puzzles of your own and see if it is able to solve it or not.

      The blanket claim of "cannot solve problems that are not in its training data" seems to be something that can be disproven by making up a puzzle from your own human creativity and seeing if it can solve it - or for that matter, how it attempts to solve it.

      It appears that there is some ability for it to reason about new things. I believe that much of this "an LLM can't do X" or "an LLM is parroting tokens that it was trained on" comes from trying to claim that all the material that it creates was created before, by a human and any use of an LLM is stealing from some human and thus unethical to use.

      ( ... and maybe if my block world or wizards and warriors and witches puzzle was in the training data somewhere, I'm unconsciously copying something somewhere else and my own use of it is unethical. )

      8 replies →

  • 1) being restricted to exact matches in input is definition of Markov Chains

    2) LLMs are not Markov Chains

    • A Markov chain [1] is a discrete-time stochastic process, in which the value of each variable depends only on the value of the immediately preceding variable, and not any variables in the past.

      LLMs are most definitely (discrete-time) Markov chains in this sense: the variables take their values in the context vectors, and the distribution of the new context window depends only on what was sampled previously context.

      A Markov chain is a Markov chain, no matter how you implement it in a computer, whether as a lookup table, or an ordinary C function, or a one-layer neural net or a transformer.

      LLMs and Markov text generators are technically both Markov chains, so some of the same math applies to both. But that's where the similarities end: e.g. the state space of an LLM is a context window, whereas the state space of a Markov text generator is usually an N-tuple of words.

      And since the question here is how tiny LLMs differ from Markov text generators, the differences certainly matter here.

      [1] https://en.wikipedia.org/wiki/Discrete-time_Markov_chain

      3 replies →

    • > 1) being restricted to exact matches in input is definition of Markov Chains

      Here's wikipedia:

      > a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

      A Markov chain is a finite state machine in which transitions between states may have probabilities other than 0 or 1. In this model, there is no input; the transitions occur according to their probability as time passes.

      > 2) LLMs are not Markov Chains

      As far as the concept of "Markov chains" has been used in the development of linguistics, they are seen as a tool of text generation. A Markov chain for this purpose is a hash table. The key is a sequence of tokens (in the state-based definition, this sequence is the current state), and the value is a probability distribution over a set of tokens.

      To rephrase this slightly, a Markov chain is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then for the following token you should choose t_1 with probability p_1, t_2 with probability p_2, etc...".

      Then, to tie this back into the state-based definition, we say that when we choose token t_k, we emit that token into the output, and we also dequeue the first token from our representation of the state and enqueue t_k at the back. This brings us into a new state where we can generate another token.

      A large language model is seen slightly differently. It is a function. The independent variable is a sequence of tokens, and the dependent variable is a probability distribution over a set of tokens. Here we say that the LLM answers the question "if the last N tokens of a fixed text were s_1, s_2, ..., s_N, what is the following token likely to be?".

      Or, rephrased, the LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

      You might notice that these two tables contain the same information organized in the same way. The transformation from an LLM to a Markov chain is the identity transformation. The only difference is in what you say you're going to do with it.

      8 replies →