Comment by cynicalsecurity

2 months ago

Ctrl+F "unicode normalisation" 0/0

I'm surprised no one has mentioned it yet. It's usually super easy, but people forget to add it all the time.

I haven’t tried it but I’ve heard that at least some unicode normalizers do not strip sequences of variation selectors.

  • Normalization implementations must not strip variation selectors by definition. The "normal" part of normalization means to convert a string into either consistently decomposed unicode, or composed unicode. ie U+00DC vs U+0055 + U+0308. However this decomposition mapping is also used (maybe more like abused) for converting certain "legacy" code points to non-legacy code points. There does not exist a rune which decomposes to variant selectors (and thus these variant selectors do not compose into anything) so normalization must not alter or strip them.

    source: I've implemented Unicode normalization from scratch