← Back to context

Comment by Dylan16807

12 hours ago

Python strings are normally Unicode, but they are augmented with this mechanism to to smuggle other data as invalid surrogates.

Rust strings are normally Unicode, but Windows OSStrings are augmented with a similar mechanism to smuggle other data as invalid surrogates. (Rust smuggles 16 bit values as WTF-8 but it could do 8 bit smuggling instead with barely any change.)

Rust chooses not to make that the main string type, but it could. Any system based on Unicode can implement a hack like this to become a superset of Unicode.

Why do you think it can't? Rust would have to admit that the type is no longer exactly Unicode, just like python did. That's the opposite of a disqualifier, it's a pattern to follow.

Maybe you're unaware that [generalized] UTF-8 has a way to encode lone surrogates? They encode into 3 bytes just fine, either ED A_ __ or ED B_ __