Comment by themafia

16 days ago

> and I end up having all these typedefs in my projects

I avoid doing this now. It's more trouble than it's worth and it changes your code from a standard dialect of C into a custom one. Plus my eyes are old and they don't enjoy separating short identifiers.

> typedef struct { ... } String

I avoid doing this. Just use `struct string { ... };'. It makes it clear what you're handling. C23 finally gave us "auto", you shouldn't fret over typedefing everything anymore. I also prefer a "strbuf" type with an index and capacity so I can safely read and write to it with a derived "strview" having pointer and length only which references into the buffer.

> returning results

The general method of returning structures larger than two machine words is fairly inefficient. Plus you're cutting yourself off from another C23 gem which was [[nodiscard]]. If you want the 'ok' value checked then you can _really_ specify that. Put everything else behind a pointer passed in an argument. The sum type logic works just as well there.

> I tend to avoid the string.h functions most of the time, only employing the mem family when I want to, well, mess with memory.

So you use strlen() a lot and don't have to deal with multibyte characters anywhere in your code. It's not much of a strategy.

25 comments

themafia

lelanthran 16 days ago

> So you use strlen() a lot and don't have to deal with multibyte characters anywhere in your code. It's not much of a strategy.

You don't need to support all multibyte encodings (i.e. DBCS, UCS-2, UCS-4, UTF-16 or UTF-32) characters if you're able to normalise all input to UTF-8.

I think, when you are building a system, restricting all (human language) input to be UTF-8 is a fair and reasonable design decision, and then you can use strlen to your hearts content.

lionkor 16 days ago
Am I missing something here? UTF8 has multibyte characters, they're just spread across multiple bytes.
When you strlen() a UTF8 string, you don't get the length of the string, but instead the size in bytes.
Same with indices. If you Index at [1] in a string with a flag emoji, you don't get a valid UTF8 code point, but instead some part of the flag emoji. This applies with any UTF8 code points larger than 1 byte, which there are a lot of.
UTF16 or UTF32 are just different encodings.
What am I missing?
That's why UTF8 libraries exist.
- flohofwoe 16 days ago
  
  > When you strlen() a UTF8 string, you don't get the length of the string, but instead the size in bytes.
  Exactly, and that's what you want/need anyway most of the time (most importantly when allocating space for the string or checking if it fits into a buffer).
  If you want the number of "characters" (which can have two meanings: either a single UNICODE code point, or a grapheme cluster (e.g. a "visible character" that's composed from multiple UNICODE code points). For this stuff you need a proper UNICODE/grapheme-aware string processing library. But this is needed only rarely in most application types which just pass strings around or occasionally need to split/parse/tokenize by 7-bit ASCII delimiters.
- GuB-42 16 days ago
  
  Turns out that I rarely need to know sizes or indices of a UTF8 string in anything other than bytes.
  If I write a parser for instance, usually, what to know is "what is the sequence of byte between this sequence of bytes and that sequence of bytes". That there are flag emojis or whatever in there don't matter, and the way UTF8 works ensures that a character representation doesn't partially overlap with a another.
  What the byte sequences mean only really matters if you are writing an editor, so that you know how many bytes to remove when you press backspace for instance.
  Truncation as to prevent buffer overflow seems to be a case where it would matter but not really. An overflow is an error and should be treated as such. Truncation is a safety mechanism, for when having your string truncated is a lesser evil. At that point, having half a flag emoji doesn't really matter.
- lelanthran 16 days ago
  
  > When you strlen() a UTF8 string, you don't get the length of the string, but instead the size in bytes.
  Yes, and?
  > What am I missing?
  A use-case? Where, in your C code, is it reasonable to get the number of multibyte characters instead of the number of bytes in the string?
  What are you going to use "number of unicode codepoints" for?
  Any usage that amounts to "I need the number of unicode codepoints in this string" is coupled to handling the display of glyphs within your program, in which case you'd be using a library for that anyway because graphics is not part of C (or C++) anyway.
  If you're simply printing it out, storing it, comparing it, searching it, etc, how would having the number of unicode codepoints help? What would it get used for?
  
  4 replies →
zzo38computer 15 days ago

I do not agree that restricting it to UTF-8 (or to Unicode in general) is a fair and reasonable design decision (although UTF-8 may be reasonable if Unicode is somehow required anyways (you should avoid requiring Unicode if you can though), especially the program is also expected to deal with ASCII in addition to requiring Unicode), but regardless of that, the number of code points is not usually relevant (and substring operations indexed by code points is not usually necessary either), and the number of bytes will be more important, and some programs should not need to know about the character encoding at all (or only have a limited consideration of what they do with them).
(One reason you might care about the number of code points is because you are converting UTF-8 to UTF-32 (or Shift-JIS to TRON-32 or whatever else) and you want to allocate the memory ahead of time. The number of characters (which is not the same as the number of code points in the case of Unicode, although for other character sets it might be) is probably not important; if you want to display it, you will care about the display width according to the font, and if you are doing editing then where one character starts and ends is going to be more significant than how many characters they are. If you are using and indexing by the number of code points a lot (even though as I say that should not usually be necessary), then you might use UTF-32 instead of UTF-8.)
(It is also my opinion that Unicode is not a good character set.)
raincole 16 days ago
> I think, when you are building a system, restricting all (human language) input to be UTF-8 is a fair and reasonable design decision, and then you can use strlen to your hearts content.
It makes no sense. If you only need the byte count then you can use strlen no matter what the encoding is. If you need any other kind of counting then you don't use strlen no matter what the encoding is (except in ASCII only environment).
"Whether I should use strlen or not" is a completely independent question to "whether my input is all UTF-8."
- lelanthran 16 days ago
  
  > If you only need the byte count then even you can use strlen no matter what the encoding is.
  No, strlen won't give you the byte count on UTF16 encodings.
  > If you need character count then you don't use strlen no matter what the encoding is (except in ASCII only environment).
  What use-case requires the character count without also requiring a unicode glyph library?
  
  1 reply →

apaprocki 16 days ago

> > typedef struct { ... } String

> I avoid doing this. Just use `struct string { ... };'. It makes it clear what you're handling.

Well then imagine if Gtk made you write `struct GtkLabel`, etc. and you saw hundreds of `struct` on the screen taking up space in heavy UI code. Sometimes abstractions are worthwhile.

wavemode 16 days ago
The main thing I dislike about typedefs is that you can't forward declare them.
If I know for sure I'm never going to need to do that then OK.
- procaryote 16 days ago
  
  How do you mean? You can at least do things like
  typedef struct foo foo;
  and somewhere else
  struct foo { … }
- flohofwoe 16 days ago
  
  The usual solution for this is:
  typedef struct bla_s { ... } bla_t;
  Now you have a struct named 'bla_s' and a type alias 'bla_t'. For the forward declaration you'd use 'bla_s'.
  Using the same name also works just fine, since structs and type aliases live in different namespaces:
  typedef struct bla_t { ... } bla_t;
  ...also before that topic comes up again: the _t postfix is not reserved in the C standard :)
  
  5 replies →
lelanthran 16 days ago

> Well then imagine if Gtk made you write `struct GtkLabel`, etc. and you saw hundreds of `struct` on the screen taking up space in heavy UI code. Sometimes abstractions are worthwhile.
TBH, in that case the GtkLabel (and, indeed, the entire widget hierarchy) should be opaque pointers anyway.
If you're not using a struct as an abstraction, then don't typedef it. If you are, then hide the damn fields.

f1shy 16 days ago

Thank you! Because I wanted to point exactly that. When I was very junior programmer, and coded alone, I used to have “that elemental header” where lots of things were inside. Many of them to convert C in what I wished it was.

Now I think is between no good idea, and absolutely awful.

Yes, sometimes you wish some thing were different in a programming language “if only these types had shorter names”. But when you work in a team, first you should have consensus, and then modifying the language becomes a heavy load, that every new person in the project will have to lift.

“Modifying C is porting the Lisp curse to C” is my motto. Use all as standard, vanilla as possible.

JKCalhoun 16 days ago

I was going to comment the same thing.

I had a coworker who had a very complicated set of "includes" that their code relied upon—not unlike the typedefs in the post. So his code was difficult to move around without also moving all his headers with it.

I try to minimize dependencies (custom headers, custom macros, etc.).