← Back to context

Comment by hyperman1

2 years ago

I fear all of this is dancing around the nasty core of the problem: generic writing to C style strings can't be done without extra information. You can't write stuff to memory without negotiating how much room is needed and available, and optionally moving the string. Silent truncation will cause bugs. Buffer overflow even more.

Fixing this now is hard: Writing to 0-ended strings require manually tracking lengths. Expanding a string without allowing malloc is misery.

The only way out I see is basically starting from zero: ISO C should define an API with a (pointer,current length, max length) struct at its core, pointer pointing to a 0-terminated C string. You can read it, but changing it requires using functions that can error out and/or malloc more memory. There are already multiple libs like this, but C has none. If the struct would be ABI, non-C programming languages can pass strings between them.

C had the opportunity to include this but they did not. It is my understanding that they wanted to design everything in C as inherent to the language, rather than magic types, especially a struct. There is an elegance in the notion that a string is just an array of characters. If I’m working with a significant amount of strings in C, I can keep track of lengths, not a huge deal.

  • Exactly this. There are no literals in C that create composite types. There are no composite types inherent to the language. All these types are defined in (system) includes.

    And zero-terminated strings are not strictly worse than other length-prefixed string forms. They save some space -- sure, less relevant today -- as well as provide an in-band termination signal -- which is hacky, again sure, but it is convenient when looking at a hex dump for example.

  • An early version of C didn't have structs, the initial attempt to get the OS off the ground failed, and after adding structs it worked. Structs are just syntactic sugar over memory offsets relative to a base pointer, a construct for which many CPUs include primitives.

  • C is lots of magic and quirkiness.

    This reminds me. From a spec/design perspective Ceylon was the cleanest language I know. Almost everything, including a lot of the keywords were actually defined in the standard library. The fact that Integer was actually a Java int behind the scenes was just a compiler implementation detail. There was very little "magic". If you wanted to know how something in the language worked you could just look at the standard library.

You can't even really assume that strings are writable, they might well be in ROM on an embedded device.