Comment by gjvc

5 days ago

yes. it was not a massive shift. it was barely worth the effort.

The Python devs didn’t want to make huge changes because they were worried Python 3 would end up taking forever like Perl 6. Instead they went to the other extreme and broke everyone’s code for trivial reasons and minimal benefit, which meant no-one wanted to upgrade.

Even the main driver for Python 3, the bytes-Unicode split, has unfortunately turned out to be sub-optimal. Python essentially bet on UTF-32 (with space-saving optimisations), while everyone else has chosen UTF-8.

  • > Python essentially bet on UTF-32 (with space-saving optimisations)

    How so? Python3 strings are unicode and all the encoding/decoding functions default to utf-8. In practice this means all the python I write is utf-8 compatible unicode and I don't ever have to think about it.

    • UTF-32 allows for constant time character accesses, which means that mystr[i] isn't O(n). Most other languages can only provide constant time access for code units.

      1 reply →

    • Internally Python holds a string as an array of uint32. A utf-8 representation is created on demand from it (and cached). So pansa2 is basically correct [^1].

      IMO, while this may not be optimal, it's far better than the more arcane choice made by other systems. For example, due to reasons only Microsoft can understand, Windows is stuck with UTF-16.

      [1] Actually it's more intelligent. For example, Python automatically uses uint8 instead of uint32 for ASCII strings.

      5 replies →

    • > all the encoding/decoding functions default to utf-8

      Languages that use UTF-8 natively don't need those functions at all. And the ones in Python aren't trivial - see, for example, `surrogateescape`.

      As the sibling comment says, the only benefit of all this encoding/decoding is that it allows strings to support constant-time indexing of code points, which isn't something that's commonly needed.

      1 reply →

  • > Python essentially bet on UTF-32 (with space-saving optimisations), while everyone else has chosen UTF-8.

    It did nothing of the sort. UTF-8 is the default source file encoding and has been the target for many APIs. It likely would have been the default for all I/O stuff if we lived in a world where Windows had functioning Unicode in the terminal the whole time and didn't base all its internal APIs on UTF-16.

    I assume you're referring to the internal representation of strings. Describing it as "UTF-32 with space-saving optimizations" is missing the point, and also a contradiction in terms. Yes, it is a system that uses the same number of bytes per character within a given string (and chooses that width according to the string contents). This makes random access possible. Doing anything else would have broken historical expectations about string slicing. There are good arguments that one shouldn't write code like that anyway, but it's hard to identify anything "sub-optimal" about the result except that strings like "I'm learning 日本語" use more memory than they might be able to get away with. (But there are other strings, like "ℍℯℓ℗", that can use a 2-byte width while the UTF-8 encoding would add 3 bytes per character.)

  • Ironically Perl 5 managed to do the bytes-Unicode split with a feature gate, no giant major version change.

this must be right, i'm getting downvoted

  • It's wrong. Python3 eliminated mountains of annoying bugs that happened all over the code base because of mixing of unicode strings and byte strings. Python2 was an absolute mess.