Comment by ofou

2 months ago

This is one of the reasons I've been advocating to use UTF-8 as a tokenizer for a long time. The actual problem IMHO are tokenizers themselves, which obscure the encoding/decoding process in order to gain some compression during training to fit more data in for the same budget, and arguably gaining some better understanding from the beginning. Again just a lack of computing power.

If you use UTF-8 directly as tokenizer, this problem becomes evident once you fit it into the context window. Plus, you can run multiple tests for this type of injection; no emoji should take more than up to 40 bytes (10 code points * 4 bytes per code point in the worst case). This is an attack on tokenizers, not on UTF-8.

Plus, Unicode publishes the full list of sequences valid containing the ZWJ character in emoji-zwj-sequences.txt