Comment by Animats
2 days ago
Right. That's one of the great features of UTF-8. You can move forwards and backwards through a UTF-8 string without having to start from the beginning.
Python has had troubles in this area. Because Python strings are indexable by character, CPython used wide characters. At one point you could pick 2-byte or 4-byte characters when building CPython. Then that switch was made automatic at run time. But it's still wide characters, not UTF-8. One emoji and your string size quadruples.
I would have been tempted to use UTF-8 internally. Indices into a string would be an opaque index type which behaved like an integer to the extent that you could add or subtract small integers, and that would move you through the string. If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated. That's an unusual case. All the standard operations, including regular expressions, can work on a UTF-8 representation with opaque index objects.
PyCompactUnicodeObject was introduced with Python 3.3, and uses UTF-8 internally. It's used whenever both size and max code point are known, which is most cases where it comes from a literal or bytes.decode() call. Cut memory usage in typical Django applications by 2/3 when it was implemented.
https://peps.python.org/pep-0393/
I would probably use UTF-8 and just give up on O(1) string indexing if I were implementing a new string type. It's very rare to require arbitrary large-number indexing into strings. Most use-cases involve chopping off a small prefix (eg. "hex_digits[2:]") or suffix (eg. "filename[-3:]"), and you can easily just linear search these with minimal CPU penalty. Or they're part of library methods where you want to have your own custom traversals, eg. .find(substr) can just do Boyer-Moore over bytes, .split(delim) probably wants to do a first pass that identifies delimiter positions and then use that to allocate all the results at once.
You usually want O(1) indexing when you're implementing views over a large string. For example, a string containing a possibly multi-megabyte text file and you want to avoid copying out of it, and work with slices where possible. Anything from editors to parsing.
I agree though that usually you only need iteration, but string APIs need to change to return some kind of token that encapsulates both logical and physical index. And you probably want to be able to compute with those - subtract to get length and so on.
You don't particularly want indexing for that, but cursors. A byte offset (wrapped in an opaque type) is sufficient for that need.
1 reply →
You really just very rarely want codepoint indexing. A byte index is totally fine for view slices.
Sure, but for something like that whatever constructs the view can use an opaque index type like Animats suggested, which under the hood is probably a byte index. The slice itself is kinda the opaque index, and then it can just have privileged access to some kind of unsafe_byteIndex accessor.
There are a variety of reasons why unsafe byte indexing is needed anyway (zero-copy?), it just shouldn’t be the default tool that application programmers reach for.
If you have multi-MB strings in an editor, that’s the problem right there. People use ropes instead of strings for a reason.
> PyCompactUnicodeObject was introduced with Python 3.3, and uses UTF-8 internally.
UTF8 is used for C level interactions, if it were just that being used there would be no need to know the highest code point.
For Python semantics it uses one of ASCII, iso-8859-1, ucs2, or ucs4.
Interesting. You're right. Code pointer:
https://github.com/python/cpython/blob/main/Objects/unicodeo...
Also implies that Animats is correct that including an emoji in a Python string can bloat the memory consumption by a factor of 4.
This is Python; finding new ways to subscript into things directly is a graduate student’s favorite pastime!
In all seriousness I think that encoding-independent constant-time substring extraction has been meaningful in letting researchers outside the U.S. prototype, especially in NLP, without worrying about their abstractions around “a 5 character subslice” being more complicated than that. Memory is a tradeoff, but a reasonably predictable one.
>without worrying about their abstractions around “a 5 character subslice” being more complicated than that
Combining characters still exist.
Indices into a Unicode string is a highly unusual operation that is rarely needed. A string is Unicode because it is provided by the user or a localized user-facing string. You don't generally need indices.
Programmer strings (aka byte strings) do need indexing operations. But such strings usually do not need Unicode.
They can happen to _be_ Unicode. Composition operations (for fully terminated Unicode strings) should work, but require eventual normalization.
That's the other part of the resume UTF8 strings mid way, even combining broken strings still results in all the good characters present.
Substring operations are more dicey; those should be operating with known strings. In pathological cases they might operate against portions of Unicode bits... but that's as silly as using raw pointers and directly mangling the bytes without any protection or design plans.
Your solution is basically what Swift does. Plus they do the same with extended grapheme clusters (what a human would consider distinct characters mostly), and that’s the default character type instead of Unicode code point. Easily the best Unicode string support of any programming language.
Variable width encodings like UTF-8 and UTF-16 cannot be indexed in O(1), only in O(N). But this is not really a problem! Instead of indexing strings we need to slice them, and generally we read them forwards, so if slices (and slices of slices) are cheap, then you can parse textual data without a problem. Basically just keep the indices small and there's no problem.
Or just use immutsble strings and look-up-tales. Say, every 32 characters, combined with cursors. This is going to make indexing fast enough for randomly jumping into a striong and the using cursors.
> If you actually converted the opaque type to a real integer, or tried to subscript the string directly, an index to the string would be generated.
What conversion rule do you want to use, though? You either reject some values outright, bump those up or down, or else start with a character index that requires an O(N) translation to a byte index.
"Unicode" aka "wide characters" is the dumbest engineering debacle of the century.
> ascii and codepage encodings are legacy, let's standardize on another forwards-incompatible standard that will be obsolete in five years > oh, and we also need to upgrade all our infrastructure for this obsolete-by-design standard because we're now keeping it forever
What about Unicode isn't forward compatible?
UCS-2 was an encoding mistake, but even it was pretty forward compatible
"Unicode" here means the OG Unicode that was supposed to fit all of past, current and future languages in exactly 16 bits.
Yes, it's a silly idea but it's exactly the reason why Python, Javascript and Java use the most brainded way of storing text known to man. (UCS-2)
1 reply →