Comment by refulgentis

1 year ago

> But you don't have to take my word for it - take an open LLM and ask it to generate integers between 7824 and 9954.

Been excited to try this all day, finally got around to this, Llama 3.1 8B did it. It's my app built on llama.cpp, no shenangians, temp 0, top p 100, 4 bit quantization, model name in screenshot [^1].

I did 7824 to 8948, it protested more for 9954, which made me reconsider whether I'd want to read that many to double check :) and I figured x + 1024 is isomorphic to the original case of you trying on OpenAI and wondering if it wasn't the result of inference.

My prior was of course it would do this, its a sequence. I understand e.g. the need for token healing cases as you correctly note, that could mess up when there's e.g. notation in an equation that prevents the "correct" digit. I don't see any reason why it'd mess up a sequential list of integers.

In general, as long as its on topic, I find the handwaving people do about tokenization being a problem to be a bit silly, I'd definitely caution against using the post you linked as a citation, it reads just like a rote repetition of the idea it causes problems, its an idea that spreads like telephone.

It's also a perfect example of the weakness of the genre: just because it sees [5077, 5068, 5938] instead of "strawberry" doesn't mean it can't infer 5077 = st = 0 5068 = raw = 1 r, 5938 = berry = 2 rs. In fact, it infers things from broken up subsequences all the time -- its how it works! If doing single character tokenization got free math / counting reliability, we'd very quickly switch to it.

(not saying you're advocating for the argument or you're misinformed, just, speaking colloquially like I would with a friend over a beer)