← Back to context

Comment by wruza

2 years ago

not using C doesn't help much if you need to talk to OS APIs

This means cdecl, stdcall or whatever modern ABIs OSes use, not C. Many languages and runtimes can call APIs and DLLs, though you may rightfully argue that their FFI or wrappers were likely compiled from C using the same ABI flags. But ABI is no magic, just a well-defined set of conventions.

And then, no one prohibits to use length-aware strings and either have safety null at the end or only copy to null-terminated before a call. Most OS calls are usually io-bound and incomparably heavy anyway.

The problem is, a null-terminated string is a very simple concept for an ABI. A string with a length count seems simple, but there is a big step up in complexity, and you can't just wistfully imagine effortlessly passing String objects around to your ABI.

For a start, String objects are going to be different everywhere. Even in C++, one library's String object isn't going to be binary compatible with another. How is the data laid out, does it do small string optimisation, etc? Are there other internal fields?

So you won't be passing objects around. At the ABI, you'll have to pass a pointer and a length. Calling an ABI will involve unwrapping and wrapping objects to pretend you are dealing with 'your' strings. Simple C-style ABIs make memory management straightforward (n.b. but error-prone, and certainly not easy). If this new style ABI returns a 'string' (pointer and length) of some sort, you have to package it up in your own object format, and manage the memory. Will you need an extra object type to represent 'string I got from an ABI, whose memory is managed differently'?

None of these are insurmountable, but they are a complexity that is rarely thought of when people declare 'C style ABIs are terrible!'

  • > For a start, String objects are going to be different everywhere. Even in C++, one library's String object isn't going to be binary compatible with another. How is the data laid out, does it do small string optimisation, etc? Are there other internal fields?

    I don't really think anyone expects a c abi to have multiple implementation defined string types. They want there to be a pointer + length string interface removing the use of null pointer style strings alltogether.

    > If this new style ABI returns a 'string' (pointer and length) of some sort, you have to package it up in your own object format

    A c function with proper error, (that is something you want to have for all your interface functions). Normally looks something like this.

    int name(T1 param_1, T2 param_2, ..., TN param_n, R1* return_1, R2* return_2, ..., RN* return_n);

    Where the return int is the error code. param_1-param_n the input parameters. result_1-result_n the results of the function.

    When writing these kinds of functions having an extra parameter for the size of the strings either for input or output is not a huge complexity increase.

    > Will you need an extra object type to represent 'string I got from an ABI, whose memory is managed differently'?

    Which memory management system you use does not impact if you use null terminated strings or a pointer + length pair. Both support stack, manual, managed or gc memory. It's just about the string representation.

    For example:

    I use a gc language.

    I call a c library which returns a string that I get ownership of.

    Now I want to leverage the gc to automatically free the string at some point. What I do is tell the gc how to free it, I have to do this no matter how the string is represented.

    Or take the inverse.

    I send in a string to the c library, which takes ownership of it.

    Now the library must know how to free the memory. Typically this is done by allocating it with a library allocator (which can be malloc) before sending it to the function. Importantly the allocator is not the same as the one we use for everything else.

    What I am getting at is that if you are not using the same memory system in the caller and the calle you have to marshal between them always. No matter if you are using null terminated strings or a pointer + length pair.

    • > pointer + length string interface

      If it's a 32 bit length, that will be limiting for some 64 bit programs.

      If it's a 64 bit length, it means tiny strings take up more space.

      Hey, do both! Have the length be a "size_t" and then have "compat_32" shim around single system call that takes at least one string argument.

      Wee!

      Imagine a parallel world in which mainstream OS kernel developers had seen the light 30 years ago and used len + data for system calls. You'd now have to be support ancient binary programs that are passing strings where the length is uint16. Oh right, I forgot! We can just screw programs that are more than five years old. All the cool users are on the latest version of everything.

      > if you are not using the same memory system in the caller and the calle you have to marshal between them always. No matter if you are using null terminated strings or a pointer + length pair.

      Null-terminated byte strings are always marshaled and ready to be sent literally anywhere. They have no byte order issues. No multi-byte length field whose size and endianness we have to know. If they are UTF-8, their character encoding is already marshaled also (that's the point of using UTF-8 everywhere).

      2 replies →

    • > don't really think anyone expects a c abi to have multiple implementation defined string types. They want there to be a pointer + length string interface removing the use of null pointer style strings alltogether

      Not so simple.

      32bit or 64bit length? Signed or unsigned? It doesn't make sense to have a signed length.

      Zero length strings are easy, what about null strings? Are you going to design the pointer + length strict to be opaque so that callers can only ever use pointers to the struct? If you don't, you cannot represent a null string (IE a missing value) differently to an empty string.

      How do callers free this string? You have to mandate that they use a special stringFree function, or rely on callers first freeing the pointer field and then freeing the struct.

      Composite data types are a lot more work and are more error prone in C.

      23 replies →

  • Pretty much all string implementations have the ability to give you a pointer and a length which you can then pass on to the foreign interface. Essentially, he API always takes a non-owning string view. C strings on the other hand require you to store that terminating NUL next to the string. This is only bearable because most string implementations are designed to deal with because C APIs are so popular.

    For returning strings, ownership is a bigger problem than the exact representation. OS APIs typically make you provide a buffer an then fail if it was not big enough.

  • >Simple C-style ABIs make memory management straightforward (n.b. but error-prone, and certainly not easy).

    The idea is to use C-style memory management: you provide a buffer, where the string is copied, for example of string return see getenv_r function: https://man.netbsd.org/getenv.3

    In C++ it's more similar to std::span.

  • you can't just wistfully imagine effortlessly passing String objects around

    To clarify, I didn’t mean it. No new style API/ABI. Only unboxing a string into (str, len) in/out-params and boxing it back from returns.

    • Lots of C programs define a more substantial string type for themselves (e.g. dynamic, reference-counted strings or what have you), used only internally. Time-honored tradition.

  • You do like in Windows and define safe strings for ABI, as done for COM API, nowadays the main kind of Windows APIs.