Comment by lelanthran

1 month ago

> When you strlen() a UTF8 string, you don't get the length of the string, but instead the size in bytes.

Yes, and?

> What am I missing?

A use-case? Where, in your C code, is it reasonable to get the number of multibyte characters instead of the number of bytes in the string?

What are you going to use "number of unicode codepoints" for?

Any usage that amounts to "I need the number of unicode codepoints in this string" is coupled to handling the display of glyphs within your program, in which case you'd be using a library for that anyway because graphics is not part of C (or C++) anyway.

If you're simply printing it out, storing it, comparing it, searching it, etc, how would having the number of unicode codepoints help? What would it get used for?

4 comments

lelanthran

tialaramex 1 month ago

Indeed. If you have output considerations then the number of Unicode codepoints isn't what you wanted anyway, you care about how many output glyphs there will be, that codepoint might result in zero glyphs, it might modify an adjacent glyph, or it might be best rendered as multiple glyphs.

If you're doing some sort of searching you want a normalization and probably pre-processing step, but again you won't care about trying to count Unicode code points.

lionkor 1 month ago

For example splitting, cutting and inserting strings into each other

lelanthran 1 month ago

> For example splitting, cutting and inserting strings into each other
That's not going to work without a glyph-aware library anyway; even if you are working with actual codepoint arrays, you can't simply insert a codepoint into that array and have a correct unicode string as the result.
Same for splitting.
flohofwoe 1 month ago

That works just fine on UTF-8 encoded strings with C stdlib functions if your delimiters are 7-bit ASCII characters (/,.:; etc...).