← Back to context

Comment by Dylan16807

19 hours ago

Python strings are normally Unicode, but they are augmented with this mechanism to to smuggle other data as invalid surrogates.

Rust strings are normally Unicode, but Windows OSStrings are augmented with a similar mechanism to smuggle other data as invalid surrogates. (Rust smuggles 16 bit values as WTF-8 but it could do 8 bit smuggling instead with barely any change.)

Rust chooses not to make that the main string type, but it could. Any system based on Unicode can implement a hack like this to become a superset of Unicode.

Why do you think it can't? Rust would have to admit that the type is no longer exactly Unicode, just like python did. That's the opposite of a disqualifier, it's a pattern to follow.

Maybe you're unaware that [generalized] UTF-8 has a way to encode lone surrogates? They encode into 3 bytes just fine, either ED A_ __ or ED B_ __

With regards to what rust team is admitting or not... https://wtf-8.codeberg.page/#the-wtf-8-encoding "It is identical to generalized UTF-8, with the additional well-formedness constraint that a surrogate pair byte sequence is ill-formed. It is a strict subset of generalized UTF-8 and a strict superset of UTF-8."

https://wtf-8.codeberg.page/#intended-audience "WTF-8 is a hack intended to be used internally in self-contained systems with components that need to support potentially ill-formed UTF-16 for legacy reasons.

Any WTF-8 data must be converted to a Unicode encoding at the system’s boundary before being emitted. UTF-8 is recommended. WTF-8 must not be used to represent text in a file format or for transmission over the Internet."

They seem very transparent, and certainly are not proposing it as a general type.

  • > With regards to what rust team is admitting or not...

    That wasn't an accusation. They admit things just fine. It was a hypothetical about using it as the main string type.

    > and certainly are not proposing it as a general type.

    1. Python's hack isn't used in file formats or transmissions either, as far as I know. It's also internal-only.

    2. What they propose it for has zero relevance to my argument. It's merely proof that a hack like this can be added to ordinary Unicode representations. Python's goofy string representation is not enabling its surrogate hack.