Comment by numpad0

1 year ago

I mean, it just felt to me that current LLM must architecturally favor fixed-length "ideome", like phoneme but for meaning, having conceived under influence of researches in CJK.

And being architecturally based a idea-tic element based, I just casually thought, there could be limits as to how much it can be pushed into perfecting English, that some radical change - not simply dropping tokenization but more fundamental - has to take place at some point.

I don't think it's hard for the LLM to treat a sequence of two tokens as a semantically meaningful unit, though. They have to handle much more complicated dependencies to parse higher-level syntactic structures of the language.