← Back to context

Comment by capitainenemo

1 day ago

Article claims python 3 uses UTF-8.

https://stackoverflow.com/questions/1838170/ "In Python 3.3 and above, the internal representation of the string will depend on the string, and can be any of latin-1, UCS-2 or UCS-4, as described in PEP 393."

Article also says PHP has immutable strings. They are mutable, although often copied.

Article also claims majority of popular languages have immutable strings. As well as the ones listed there is also PHP and Rust (and C, but they did say C++ - and obviously Ruby since that's the subject of the article).

I'm also a bit surprised by the last sentence. "However, if you do measure a negative performance impact, there is no doubt you are measuring incorrectly." There must surely be programs doing a lot of string building or in-place modification that would benefit from non-frozen.

> There must surely be programs doing a lot of string building or in-place modification that would benefit from non-frozen.

The point is that the magic comment (or the --enable-frozen-string-literal) only applies to literals. If you have some code using mutable strings to iteratively append to it, flipping that switch doesn't change that. It just means you'll have to explicitly create a mutable string. So it doesn't change the performance profile.

Python strings aren’t even proper Unicode strings. They’re sequences of code points rather than scalar values, meaning they can contain surrogates. This is incompatible with basically everything: UTF-* as used by sensible things, and unvalidated UTF-16 as used in the likes of JavaScript, Windows wide strings and Qt.

  • But isn't 'surrogateescape' supposed to address this? (no expert)

    https://vstinner.github.io/pep-383.html

    • surrogateescape is something else altogether. It’s a hack to allow non-Unicode file names/environment variables/command line arguments in an otherwise-Unicode environment, by smuggling them through a part of the surrogate range (0x80 to 0xFF → U+DC80 to U+DCFF) which otherwise can’t occur (since it’s invalid Unicode). It’s a cunning hack that makes a lot of sense: they used a design error in one place (Python string representation) to cancel out a design error in another place (POSIX being late to the game on Unicode)!

      1 reply →

> can be any of latin-1, UCS-2 or UCS-4, as described in PEP 393

My bad, I haven't seriously used Python for over 15 years now, so I stand corrected (and will clarify the post).

My main point stands though, Python strings have an internal representation, but it's not exposed to the user like Ruby strings.

> Article also says PHP has immutable strings. They are mutable, although often copied.

Same. Thank you for the correction, I'll update the post.

  • Cool, although I feel if on one side you have Java, JavaScript, Python, Go and on the other Perl, PHP, C/C++, Ruby, Rust it's hard to say overwhelming majority in either direction.

    Also someone below claims python byte arrays can be considered mutable strings, although I have no idea of the stringy ergonomics of that and whether it would be convenient to do - I try to avoid python too.

    • ... and honestly, since java has both stringbuffer and string I feel it's really in the "has mutable" camp too

In C, C++, and Rust, the question of "are strings in this language mutable or immutable?" isn't applicable, because those languages have transitive mutability qualifiers. So they only need a single string type, and whether you can mutate it or not depends on context. (C++ and Rust have multiple string types, but the differences among them aren't about mutability.) In languages without this feature, a given value is either always mutable or never mutable, and so it's necessary to pick one or the other for string literals.