← Back to context

Comment by chrismorgan

3 days ago

> unicode scalars, which most languages index strings in

Very few do. Of moderately popular languages, Python is the only one I can think of. Well, Python strings are actually sequences of code points rather than scalars, which is a huge mistake, but provided your strings came from valid Unicode that doesn’t matter.

Languages like Rust and Swift make it fairly easy to access your string by UTF-8 or by scalar.

Languages like Java and JavaScript index by UTF-16 code unit and make anything else at least moderately painful.

> This is somewhat of an unfortunate tech debt thing as I understand, and it was made this way mostly because of JavaScript, which doesn’t work with UTF-8 natively. But this means you need to be extra careful with the indexes in most languages.

I’m confused here. You established indexing is by UTF-8 code unit, then said it’s because of JavaScript which… doesn’t do UTF-8 so well? If it were indexed by UTF-16 code unit, I’d agree, that’s bad tech debt; but that’s not the case here.

Bluesky made the decision to go all in on UTF-8 here <https://docs.bsky.app/docs/advanced-guides/post-richtext#tex...>—after all, the strings are being stored and transferred in UTF-8, and UTF-8 is increasingly the tool of choice, and UTF-16 is increasingly reviled, almost nothing new has chosen it for twenty years, and nothing major has chosen it for ten years, it’s all strictly legacy. Hugely popular legacy, sure, but legacy.

Hmm… Yeah, I guess each language does it kinda differently. At least Ruby also does it similarly like Python.

> I’m confused here. You established indexing is by UTF-8 code unit, then said it’s because of JavaScript which… doesn’t do UTF-8 so well?

It's not that UTF-8 is because of JavaScript, it's that indexing by bytes instead of UTF-8 code units is because of JavaScript. To use UTF-8 in JavaScript, you can use TextEncoder/TextDecoder, which return the string as a Uint8Array, which is indexed by bytes.

So if you have a string "Cześć, #Bluesky!" and you want to mark the "#Bluesky" part with a hashtag link facet, the index range is 9...17 (bytes), and not 7...15 (scalars).

  • > indexing by bytes instead of UTF-8 code units

    When the encoding is UTF-8 (which it is here), the code unit is the byte.

    They called the fields byteStart and byteEnd, but a more technically precise (no more or less accurate, but more precise) labels would be utf8CodeUnitStart and utf8CodeUnitEnd.

    • Sorry, I keep mixing these - bytes instead of scalars, which I think would be more natural to iterate over in most languages (at least the ones I use).

      2 replies →