Comment by jfim

3 days ago

Counting letters is tricky for LLMs because they operate on tokens, not letters. From the perspective of a LLM, if you ask it "this is a sentence, count the letters in it" it doesn't see a stream of characters like we do, it sees [851, 382, 261, 21872, 11, 3605, 290, 18151, 306, 480].

So what? It knows number of letters in each token, and can sum them together.

  • How does it know the letters in the token?

    It doesn't.

    There's literally no mapping anywhere of the letters in a token.

    • There is a mapping. An internal, fully learned mapping that's derived from seeing misspellings and words spelled out letter by letter. Some models make it an explicit part of the training with subword regularization, but many don't.

      It's hard to access that mapping though.

      A typical LLM can semi-reliably spell common words out letter by letter - but it can't say how many of each are in a single word immediately.

      But spelling the word out first and THEN counting the letters? That works just fine.

    • If it did frequency analysis then I would consider it having a PhD level intelligence, not just a PhD level of knowledge (like a dictionary).