Comment by microtonal
4 days ago
Markov models usually only predict the next token given the two preceding tokens (trigram model) because the data gets so exceptionally sparse beyond that, that it becomes impossible to make probability estimations (despite back-off, smoothing, etc.).
I recommend you to read Bengio et al.’s 2003 paper which describes this issue in more detail and introduces distributional representations (embeddings) in an RNN to avoid this sparsity.
While we are using transformers and sentence pieces now, this paper aptly describes the motivation underpinning modern models.
> Markov models usually only predict the next token given the two preceding tokens (trigram model) because the data gets so exceptionally sparse beyond that
Of course, that's because it is a probability along a single dimension with a chain-length along that one dimension while LLMs and NNs use multiple dimensions (They are meshed, not chained).
I really want to know what the result would look like with a few more dimensions resulting in a markov mesh type structure rather than a chain structure.
Thanks for the reference and I stand corrected. And yes I had looked at it a long time ago and will give it another read. But I think it is saying that RNNs are a means of approximating a statistical property of a collection of text. That property is what we today think of as "completion"? That is, glorified auto complete, and not "distributed representations" of the world. Would you agree?
distributed representations
Distributional representations, not distributed.
https://en.wikipedia.org/wiki/Distributional_semantics#Distr...
This is the problem. I am arguing there are no distributed representations (cf Harnad's original symbol grounding problem paper, Hinton, and others). There are "distributional representations" by definition (cf the Wikipedia entry)