← Back to context

Comment by viraptor

8 hours ago

Why bother testing though? I was hoping this topic has finally died recently, but no. Someone's still interested in testing LLMs for something they're explicitly not designed for and nobody is using them for this in practice. I really hope one day openai will just add a "when asked about character level changes, insights and encodings, generate and run a program to answer it" to their system so we can never hear about it again...

One reason for testing this is that it might indicate how accurately models can explain natural language grammar, especially for agglutinative and fusional languages, which form words by stringing morphemes together. When I tested ChatGPT a couple of years ago, it sometimes made mistakes identifying the components of specific Russian and Japanese words. I haven’t run similar tests lately, but it would be nice to know how much language learners can depend on LLM explanations about the word-level grammars of the languages they are studying.

Later: I asked three LLMs to draft such a test. Gemini’s [1] looks like a good start. When I have time, I’ll try to make it harder, double-check the answers myself, and then run it on some older and newer models.

[1] https://g.co/gemini/share/5eefc9aed193

  • What you are testing for is fundamentally different than character level text manipulation.

    A major optimization in modern LLMs is tokenization. This optimization is based on the assumption that we do not care about character level details, so we can combine adjacent characters into tokens, then train and run the main AI model on smaller strings built out of a much larger dictionary of tokens. Given this architecture, it is impressive that AIs can perform character level operations at all. They essentially need to reverse engineer the tokenization process.

    However, morphemes are semantically meaningful, so a quality tokenizer will tokenize at the morpheme level, instead of the word level. [0]. This is of particuarly obvious importance in Japanese, as the lack of spaces between words means that the naive "tokenize on whitespace" approach is simply not possible.

    We can explore the tokenizer of various models here: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

    Looking at the words in your example, we see the tokenization of the Gemma model (closely related to Gemini) is:

      un-belie-vably
      dec-entral-ization
      bio-degradable
      mis-understanding
      anti-dis-establishment-arian-ism
      пере-писы-ваться
      pere-pis-y-vat-'-s-ya
      до-сто-примеча-тельность
      do-stop-rime-chat-el-'-nost-'
      пре-по-дава-тель-ница
      бе-зо-т-вет-ственности
      bezotvetstvennosti
      же-лез-нодоро-жный
      z-hele-zn-odoro-zh-ny-y
      食べ-させ-られた-くな-かった
      tab-es-aser-are-tak-unak-atta)
      図書館
      tos-ho-kan
      情報-技術
      j-ō-h-ō- gij-utsu
      国際-関係
      kok-us-ai- kan-kei
      面白-くな-さ-そうだ
    

    Further, the training data that is likely to be relevent in this type of query probably isolates the individual morphemes while talking about a bunch of words that the use them; so it is a much shorter path for the AI to associate these close but not quite morphene tokens with the actual sequence of tokens that corresponds to what we think of as a morphene.

    [0] Morpheme level tokenization is itself a non-trivial problem. However, has been pretty well solved long before the current generation of AI.

    • Tokenizers are typically optimized for efficiency, not morpheme separation. Even in the examples above it's not morphemes - proper morpheme separation would be un-believ-ably and дост-о-при-меч-а-тельн-ость.

      Regardless of this, Gemini is still one of the best models when it comes for Slavic word formation and manipulation, it can express novel (non-existent) words pretty well and doesn't seem to be confused by wrong separation. This seems to be the result of extensive multilingual training, because e.g. GPT other than the discontinued 4.5-preview and many Chinese models have issues with basic coherency in languages that heavily rely on word formation, despite using similar tokenizers.

    • Thanks for the explanation. Very interesting.

      I notice that that particular tokenization deviates from the morphemic divisions in several cases, including ‘dec-entral-ization’, ‘食べ-させ-られた-くな-かった’, and ‘面白-くな-さ-そうだ.’ ‘dec’ and ‘entral’ are not morphemes, nor is ‘くな.’

Why test for something? I find it fascinating if something starts being good at task it is "explicitly not designed for" (which I don't necessarily agree with - it's more of a side effect of their architecture).

I also don't agree that nobody is using this for - there are real life use cases today, such as people trying to find meaning of misspelled words.

On a side note, I remember testing Claude 3.7 with the classic "R's in the word strawberry" question through their chat interface, and given that it's really good at tool calls, it actually created a website to a) count it with JavaScript, b) visualize it on a page. Other models I tested for the blog post were also giving me python code for solving the issue. This is definitely already a thing and it works well for some isolated problems.

  • > such as people trying to find meaning of misspelled words.

    That worked just fine for quite a while. There's apparently enough misspelling in the training data, we don't need precise spelling for it. You can literally write drunken gibberish and it will work.

Character level LLMs are used for detecting insults and toxic chat in video games and the like.

  • Yes, for small messages and relatively small scope dictionary, character level will work. But that's very different from what's tested here.

  • Can you give an example of a video game explicitly using character-level LLMs? There were prototypes of char-rnns back in the day for chat moderation but it has significant compute overhead.

  • I figure an LLM would be way better at classifying insults than regexing against a bad word list. Why would character level be desirable?

I remember people making the exact same argument about asking LLMs math questions back when they couldn't figure out the answer to 18 times 7. "They are text token predictors, they don't understand numbers, can we put this nonsense to rest."

The whole point of LLMs is that they do more than we suspected they could. And there is value in making them capable of handling a wider selection of tasks. When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.

  • They're better at maths now, but you still shouldn't ask them maths questions. Same as spelling - whether they improve or not doesn't matter if you want a specific, precise answer - it's the wrong tool and the better it does, the bigger the trap of it failing unexpectedly.

  • > When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.

    Were they? Or did they feel icky about spending way to much post-training time on such a specific and uninteresting skill?

I made a response to this counterpoint in a blog post I wrote about a similar question posed to LLMs (how many b's are in blueberry): https://news.ycombinator.com/item?id=44878290

> Yes, asking an LLM how many b’s are in blueberry is an adversarial question in the sense that the questioner is expecting the LLM to fail. But it’s not an unfair question, and it’s objectively silly to claim that LLMs such as GPT-5 can operate at a PhD level, but can’t correctly count the number of letters in a word.

It's a subject that the Hacker News bubble and the real world treat differently.

  • It’s like defending a test showing hammers are terrible at driving screws by saying many people are unclear on how to use tools.

    It remains unsurprising that a technology that lumps characters together is not great at processing below its resolution.

    Now, if there are use cases other than synthetic tests where this capability is important, maybe there’s something interesting. But just pointing out that one can’t actually climb the trees pictured on the map is not that interesting.

    • And yet... now many of them can do it. I think it's premature to say "this technology is for X" when what it was originally invented for was translation, and every capability it has developed since then has been an immense surprise.

      2 replies →

Wouldn't a llm that just tokenized by character be good at it?

  • I asked this in another thread and it would only be better with unlimited compute and memory.

    Because without those, then the llm has to encode way more parameters and way smaller context windows.

    In a theoretical world, it would be better, but might not be much better.