Comment by psionides
3 days ago
Hmm… Yeah, I guess each language does it kinda differently. At least Ruby also does it similarly like Python.
> I’m confused here. You established indexing is by UTF-8 code unit, then said it’s because of JavaScript which… doesn’t do UTF-8 so well?
It's not that UTF-8 is because of JavaScript, it's that indexing by bytes instead of UTF-8 code units is because of JavaScript. To use UTF-8 in JavaScript, you can use TextEncoder/TextDecoder, which return the string as a Uint8Array, which is indexed by bytes.
So if you have a string "Cześć, #Bluesky!" and you want to mark the "#Bluesky" part with a hashtag link facet, the index range is 9...17 (bytes), and not 7...15 (scalars).
> indexing by bytes instead of UTF-8 code units
When the encoding is UTF-8 (which it is here), the code unit is the byte.
They called the fields byteStart and byteEnd, but a more technically precise (no more or less accurate, but more precise) labels would be utf8CodeUnitStart and utf8CodeUnitEnd.
Sorry, I keep mixing these - bytes instead of scalars, which I think would be more natural to iterate over in most languages (at least the ones I use).
OK, checked and Ruby does seem to use scalars. Well, unless you mess with encodings. Then it’s messy. So it’s probably better and worse than Python 3.
You may not have seen this interesting article before: https://hsivonen.fi/string-length/. I agree with its assessment that scalars are really pretty useless as a measure, and Python and Ruby are foolish to have chased it at such expense.
But seriously, I can’t think of any other popular languages that count by scalars or code points—it’s definitely not most languages, it’s a minority, all a very specific sort of language. “Most” encompasses well-formed UTF-8 (e.g. Rust), recommended UTF-8 but it doesn’t actually care (e.g. Go), potentially ill-formed UTF-16 (e.g. JavaScript, Java, .NET), and total-mess (e.g. C, C++).
1 reply →