← Back to context

Comment by lelanthran

11 hours ago

> So you use strlen() a lot and don't have to deal with multibyte characters anywhere in your code. It's not much of a strategy.

You don't need to support all multibyte encodings (i.e. DBCS, UCS-2, UCS-4, UTF-16 or UTF-32) characters if you're able to normalise all input to UTF-8.

I think, when you are building a system, restricting all (human language) input to be UTF-8 is a fair and reasonable design decision, and then you can use strlen to your hearts content.

Am I missing something here? UTF8 has multibyte characters, they're just spread across multiple bytes.

When you strlen() a UTF8 string, you don't get the length of the string, but instead the size in bytes.

Same with indices. If you Index at [1] in a string with a flag emoji, you don't get a valid UTF8 code point, but instead some part of the flag emoji. This applies with any UTF8 code points larger than 1 byte, which there are a lot of.

UTF16 or UTF32 are just different encodings.

What am I missing?

That's why UTF8 libraries exist.

  • > When you strlen() a UTF8 string, you don't get the length of the string, but instead the size in bytes.

    Exactly, and that's what you want/need anyway most of the time (most importantly when allocating space for the string or checking if it fits into a buffer).

    If you want the number of "characters" (which can have two meanings: either a single UNICODE code point, or a grapheme cluster (e.g. a "visible character" that's composed from multiple UNICODE code points). For this stuff you need a proper UNICODE/grapheme-aware string processing library. But this is needed only rarely in most application types which just pass strings around or occasionally need to split/parse/tokenize by 7-bit ASCII delimiters.

  • Turns out that I rarely need to know sizes or indices of a UTF8 string in anything other than bytes.

    If I write a parser for instance, usually, what to know is "what is the sequence of byte between this sequence of bytes and that sequence of bytes". That there are flag emojis or whatever in there don't matter, and the way UTF8 works ensures that a character representation doesn't partially overlap with a another.

    What the byte sequences mean only really matters if you are writing an editor, so that you know how many bytes to remove when you press backspace for instance.

    Truncation as to prevent buffer overflow seems to be a case where it would matter but not really. An overflow is an error and should be treated as such. Truncation is a safety mechanism, for when having your string truncated is a lesser evil. At that point, having half a flag emoji doesn't really matter.

  • > When you strlen() a UTF8 string, you don't get the length of the string, but instead the size in bytes.

    Yes, and?

    > What am I missing?

    A use-case? Where, in your C code, is it reasonable to get the number of multibyte characters instead of the number of bytes in the string?

    What are you going to use "number of unicode codepoints" for?

    Any usage that amounts to "I need the number of unicode codepoints in this string" is coupled to handling the display of glyphs within your program, in which case you'd be using a library for that anyway because graphics is not part of C (or C++) anyway.

    If you're simply printing it out, storing it, comparing it, searching it, etc, how would having the number of unicode codepoints help? What would it get used for?

    • Indeed. If you have output considerations then the number of Unicode codepoints isn't what you wanted anyway, you care about how many output glyphs there will be, that codepoint might result in zero glyphs, it might modify an adjacent glyph, or it might be best rendered as multiple glyphs.

      If you're doing some sort of searching you want a normalization and probably pre-processing step, but again you won't care about trying to count Unicode code points.

> I think, when you are building a system, restricting all (human language) input to be UTF-8 is a fair and reasonable design decision, and then you can use strlen to your hearts content.

It makes no sense. If you only need the byte count then you can use strlen no matter what the encoding is. If you need any other kind of counting then you don't use strlen no matter what the encoding is (except in ASCII only environment).

"Whether I should use strlen or not" is a completely independent question to "whether my input is all UTF-8."

  • > If you only need the byte count then even you can use strlen no matter what the encoding is.

    No, strlen won't give you the byte count on UTF16 encodings.

    > If you need character count then you don't use strlen no matter what the encoding is (except in ASCII only environment).

    What use-case requires the character count without also requiring a unicode glyph library?

    • > strlen won't give you the byte count on UTF16 encodings.

      You're right. I stand corrected.