Comment by gizmo686

13 hours ago

What you are testing for is fundamentally different than character level text manipulation.

A major optimization in modern LLMs is tokenization. This optimization is based on the assumption that we do not care about character level details, so we can combine adjacent characters into tokens, then train and run the main AI model on smaller strings built out of a much larger dictionary of tokens. Given this architecture, it is impressive that AIs can perform character level operations at all. They essentially need to reverse engineer the tokenization process.

However, morphemes are semantically meaningful, so a quality tokenizer will tokenize at the morpheme level, instead of the word level. [0]. This is of particuarly obvious importance in Japanese, as the lack of spaces between words means that the naive "tokenize on whitespace" approach is simply not possible.

We can explore the tokenizer of various models here: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

Looking at the words in your example, we see the tokenization of the Gemma model (closely related to Gemini) is:

  un-belie-vably
  dec-entral-ization
  bio-degradable
  mis-understanding
  anti-dis-establishment-arian-ism
  пере-писы-ваться
  pere-pis-y-vat-'-s-ya
  до-сто-примеча-тельность
  do-stop-rime-chat-el-'-nost-'
  пре-по-дава-тель-ница
  бе-зо-т-вет-ственности
  bezotvetstvennosti
  же-лез-нодоро-жный
  z-hele-zn-odoro-zh-ny-y
  食べ-させ-られた-くな-かった
  tab-es-aser-are-tak-unak-atta)
  図書館
  tos-ho-kan
  情報-技術
  j-ō-h-ō- gij-utsu
  国際-関係
  kok-us-ai- kan-kei
  面白-くな-さ-そうだ

Further, the training data that is likely to be relevent in this type of query probably isolates the individual morphemes while talking about a bunch of words that the use them; so it is a much shorter path for the AI to associate these close but not quite morphene tokens with the actual sequence of tokens that corresponds to what we think of as a morphene.

[0] Morpheme level tokenization is itself a non-trivial problem. However, has been pretty well solved long before the current generation of AI.

4 comments

gizmo686

orbital-decay 10 hours ago

Tokenizers are typically optimized for efficiency, not morpheme separation. Even in the examples above it's not morphemes - proper morpheme separation would be un-believ-ably and дост-о-при-меч-а-тельн-ость.

Regardless of this, Gemini is still one of the best models when it comes for Slavic word formation and manipulation, it can express novel (non-existent) words pretty well and doesn't seem to be confused by wrong separation. This seems to be the result of extensive multilingual training, because e.g. GPT other than the discontinued 4.5-preview and many Chinese models have issues with basic coherency in languages that heavily rely on word formation, despite using similar tokenizers.

tkgally 12 hours ago

Thanks for the explanation. Very interesting.

I notice that that particular tokenization deviates from the morphemic divisions in several cases, including ‘dec-entral-ization’, ‘食べ-させ-られた-くな-かった’, and ‘面白-くな-さ-そうだ.’ ‘dec’ and ‘entral’ are not morphemes, nor is ‘くな.’

curioussquirrel 11 hours ago

Thanks for the explanation and for the tokenizer playground link!

DonHopkins 9 hours ago

inf-ucking-credible