Comment by chrismorgan
2 days ago
surrogateescape is something else altogether. It’s a hack to allow non-Unicode file names/environment variables/command line arguments in an otherwise-Unicode environment, by smuggling them through a part of the surrogate range (0x80 to 0xFF → U+DC80 to U+DCFF) which otherwise can’t occur (since it’s invalid Unicode). It’s a cunning hack that makes a lot of sense: they used a design error in one place (Python string representation) to cancel out a design error in another place (POSIX being late to the game on Unicode)!
It's not taking advantage of the weird way python strings work. You can put that hack on top of any string format that converts back and forth with unicode.
No you can’t: it only works if your string type is something other than a sequence of Unicode scalar values. In Rust, for example, strings must be valid UTF-8, so this hack is not possible.
Python strings are normally Unicode, but they are augmented with this mechanism to to smuggle other data as invalid surrogates.
Rust strings are normally Unicode, but Windows OSStrings are augmented with a similar mechanism to smuggle other data as invalid surrogates. (Rust smuggles 16 bit values as WTF-8 but it could do 8 bit smuggling instead with barely any change.)
Rust chooses not to make that the main string type, but it could. Any system based on Unicode can implement a hack like this to become a superset of Unicode.
Why do you think it can't? Rust would have to admit that the type is no longer exactly Unicode, just like python did. That's the opposite of a disqualifier, it's a pattern to follow.
Maybe you're unaware that [generalized] UTF-8 has a way to encode lone surrogates? They encode into 3 bytes just fine, either ED A_ __ or ED B_ __