Comment by thatjoeoverthr

3 months ago

Others have mentioned the large context window. This matters.

But also important is embeddings.

Tokens in a classic Markov chain are discrete surrogate keys. “Love”, for example, and “love” are two different tokens. As are “rage” and “fury”.

In a modern model, we start with an embedding model, and build a LUT mapping token identities to vectors.

This does two things for you.

First, it solves the above problem, which is that “different” tokens can be conceptually similar. They’re embedded in a space where they can be compared and contrasted in many dimensions, and it becomes less sensitive to wording.

Second, because the incoming context is now a tensor, it can be used with differentiable model, back propagation and so forth.

I did something with this lately, actually, using a trained BERT model as a reranker for Markov chain emmisions. It’s rough but manages multiturn conversation on a consumer GPU.

https://joecooper.me/blog/crosstalk/

1 comment

thatjoeoverthr

cestith 3 months ago

The case sensitivity or case insensitivity of a token is an implementation detail. I also haven’t seen evidence that the Markov function definitionally can’t use a lookup table of synonyms when predicting the next token.