Comment by nostrademons

1 day ago

PyCompactUnicodeObject was introduced with Python 3.3, and uses UTF-8 internally. It's used whenever both size and max code point are known, which is most cases where it comes from a literal or bytes.decode() call. Cut memory usage in typical Django applications by 2/3 when it was implemented.

https://peps.python.org/pep-0393/

I would probably use UTF-8 and just give up on O(1) string indexing if I were implementing a new string type. It's very rare to require arbitrary large-number indexing into strings. Most use-cases involve chopping off a small prefix (eg. "hex_digits[2:]") or suffix (eg. "filename[-3:]"), and you can easily just linear search these with minimal CPU penalty. Or they're part of library methods where you want to have your own custom traversals, eg. .find(substr) can just do Boyer-Moore over bytes, .split(delim) probably wants to do a first pass that identifies delimiter positions and then use that to allocate all the results at once.

8 comments

nostrademons

barrkel 19 hours ago

You usually want O(1) indexing when you're implementing views over a large string. For example, a string containing a possibly multi-megabyte text file and you want to avoid copying out of it, and work with slices where possible. Anything from editors to parsing.

I agree though that usually you only need iteration, but string APIs need to change to return some kind of token that encapsulates both logical and physical index. And you probably want to be able to compute with those - subtract to get length and so on.

ori_b 18 hours ago
You don't particularly want indexing for that, but cursors. A byte offset (wrapped in an opaque type) is sufficient for that need.
- bjoli 8 hours ago
  
  You could add a LUT for decently fast indexing as well. I believe Java does that.
naniwaduni 17 hours ago

You really just very rarely want codepoint indexing. A byte index is totally fine for view slices.
nostrademons 18 hours ago

Sure, but for something like that whatever constructs the view can use an opaque index type like Animats suggested, which under the hood is probably a byte index. The slice itself is kinda the opaque index, and then it can just have privileged access to some kind of unsafe_byteIndex accessor.
There are a variety of reasons why unsafe byte indexing is needed anyway (zero-copy?), it just shouldn’t be the default tool that application programmers reach for.
MrBuddyCasino 11 hours ago

If you have multi-MB strings in an editor, that’s the problem right there. People use ropes instead of strings for a reason.

masklinn 12 hours ago

> PyCompactUnicodeObject was introduced with Python 3.3, and uses UTF-8 internally.

UTF8 is used for C level interactions, if it were just that being used there would be no need to know the highest code point.

For Python semantics it uses one of ASCII, iso-8859-1, ucs2, or ucs4.

nostrademons 4 hours ago

Interesting. You're right. Code pointer:
https://github.com/python/cpython/blob/main/Objects/unicodeo...
Also implies that Animats is correct that including an emoji in a Python string can bloat the memory consumption by a factor of 4.