← Back to context

Comment by flohofwoe

2 years ago

Pretty much all operating system APIs use C-style zero-terminated strings. So while C may be historically responsible for the problem, not using C doesn't help much if you need to talk to OS APIs.

not using C doesn't help much if you need to talk to OS APIs

This means cdecl, stdcall or whatever modern ABIs OSes use, not C. Many languages and runtimes can call APIs and DLLs, though you may rightfully argue that their FFI or wrappers were likely compiled from C using the same ABI flags. But ABI is no magic, just a well-defined set of conventions.

And then, no one prohibits to use length-aware strings and either have safety null at the end or only copy to null-terminated before a call. Most OS calls are usually io-bound and incomparably heavy anyway.

  • The problem is, a null-terminated string is a very simple concept for an ABI. A string with a length count seems simple, but there is a big step up in complexity, and you can't just wistfully imagine effortlessly passing String objects around to your ABI.

    For a start, String objects are going to be different everywhere. Even in C++, one library's String object isn't going to be binary compatible with another. How is the data laid out, does it do small string optimisation, etc? Are there other internal fields?

    So you won't be passing objects around. At the ABI, you'll have to pass a pointer and a length. Calling an ABI will involve unwrapping and wrapping objects to pretend you are dealing with 'your' strings. Simple C-style ABIs make memory management straightforward (n.b. but error-prone, and certainly not easy). If this new style ABI returns a 'string' (pointer and length) of some sort, you have to package it up in your own object format, and manage the memory. Will you need an extra object type to represent 'string I got from an ABI, whose memory is managed differently'?

    None of these are insurmountable, but they are a complexity that is rarely thought of when people declare 'C style ABIs are terrible!'

    • > For a start, String objects are going to be different everywhere. Even in C++, one library's String object isn't going to be binary compatible with another. How is the data laid out, does it do small string optimisation, etc? Are there other internal fields?

      I don't really think anyone expects a c abi to have multiple implementation defined string types. They want there to be a pointer + length string interface removing the use of null pointer style strings alltogether.

      > If this new style ABI returns a 'string' (pointer and length) of some sort, you have to package it up in your own object format

      A c function with proper error, (that is something you want to have for all your interface functions). Normally looks something like this.

      int name(T1 param_1, T2 param_2, ..., TN param_n, R1* return_1, R2* return_2, ..., RN* return_n);

      Where the return int is the error code. param_1-param_n the input parameters. result_1-result_n the results of the function.

      When writing these kinds of functions having an extra parameter for the size of the strings either for input or output is not a huge complexity increase.

      > Will you need an extra object type to represent 'string I got from an ABI, whose memory is managed differently'?

      Which memory management system you use does not impact if you use null terminated strings or a pointer + length pair. Both support stack, manual, managed or gc memory. It's just about the string representation.

      For example:

      I use a gc language.

      I call a c library which returns a string that I get ownership of.

      Now I want to leverage the gc to automatically free the string at some point. What I do is tell the gc how to free it, I have to do this no matter how the string is represented.

      Or take the inverse.

      I send in a string to the c library, which takes ownership of it.

      Now the library must know how to free the memory. Typically this is done by allocating it with a library allocator (which can be malloc) before sending it to the function. Importantly the allocator is not the same as the one we use for everything else.

      What I am getting at is that if you are not using the same memory system in the caller and the calle you have to marshal between them always. No matter if you are using null terminated strings or a pointer + length pair.

      27 replies →

    • Pretty much all string implementations have the ability to give you a pointer and a length which you can then pass on to the foreign interface. Essentially, he API always takes a non-owning string view. C strings on the other hand require you to store that terminating NUL next to the string. This is only bearable because most string implementations are designed to deal with because C APIs are so popular.

      For returning strings, ownership is a bigger problem than the exact representation. OS APIs typically make you provide a buffer an then fail if it was not big enough.

    • >Simple C-style ABIs make memory management straightforward (n.b. but error-prone, and certainly not easy).

      The idea is to use C-style memory management: you provide a buffer, where the string is copied, for example of string return see getenv_r function: https://man.netbsd.org/getenv.3

      In C++ it's more similar to std::span.

    • you can't just wistfully imagine effortlessly passing String objects around

      To clarify, I didn’t mean it. No new style API/ABI. Only unboxing a string into (str, len) in/out-params and boxing it back from returns.

      1 reply →

    • You do like in Windows and define safe strings for ABI, as done for COM API, nowadays the main kind of Windows APIs.

I suspect null terminated strings predate C, C is just just one of many languages that can use them.

  • The PDP-10 and PDP-11 assemblers had direct support for nul-terminated strings (ASCIZ directives, and OUTSTR in MACRO10) which Ritchie adopted as-is, not unlike Lisp’s CAR/CDR. It’s not entirely clear that other “high-level” languages at the time also used such a type.

    Although later ISA added support for it for C compatibility, whereas older ISAs tended to only support fixed-length or length-prefixed, for instance the Z80 has LDIR, which is essentially a memcpy, copying a terminated string required a manual loop.

All non-dynamic string representations give rise to the situations where programmers need to combine strings that don't fit into the destination.

Whether null-terminated or not, dynamic strings that solve the problem of being able to add two strings together without worrying whether the destination buffer is large enough (trading that problem for DoS concerns when a malicious agent may feed a huge input to the program).

Nothing prevents those operating systems from offering custom string types.

  • In reality, a ton of stuff does. As an example: What do you do if someone calls your new string+length API with an embedded \0 character? Your internal functions are all still written in C and using char* so they will silently truncate the string. So you need to check and reject that. Except you forgot there are also APIs (like the extended attrs APIs) that do accept embedded \0. The exceptions are all over the place, in ioctl calls passed to weird device drivers etc.

    • Windows internally uses string+length struct, null terminated string API is just compatibility interface on top of it.

  • *new operating systems

    You can't change the string type without breaking all apps and services.

    • Even on a new OS it's going to be a compatibility problem. Implementing even partial POSIX compatibility makes porting stuff easier, but changing how stings work is going to make it significantly harder.