Comment by thatjoeoverthr
11 days ago
Others have mentioned the large context window. This matters.
But also important is embeddings.
Tokens in a classic Markov chain are discrete surrogate keys. “Love”, for example, and “love” are two different tokens. As are “rage” and “fury”.
In a modern model, we start with an embedding model, and build a LUT mapping token identities to vectors.
This does two things for you.
First, it solves the above problem, which is that “different” tokens can be conceptually similar. They’re embedded in a space where they can be compared and contrasted in many dimensions, and it becomes less sensitive to wording.
Second, because the incoming context is now a tensor, it can be used with differentiable model, back propagation and so forth.
I did something with this lately, actually, using a trained BERT model as a reranker for Markov chain emmisions. It’s rough but manages multiturn conversation on a consumer GPU.
The case sensitivity or case insensitivity of a token is an implementation detail. I also haven’t seen evidence that the Markov function definitionally can’t use a lookup table of synonyms when predicting the next token.