← Back to context

Comment by lelanthran

2 years ago

> don't really think anyone expects a c abi to have multiple implementation defined string types. They want there to be a pointer + length string interface removing the use of null pointer style strings alltogether

Not so simple.

32bit or 64bit length? Signed or unsigned? It doesn't make sense to have a signed length.

Zero length strings are easy, what about null strings? Are you going to design the pointer + length strict to be opaque so that callers can only ever use pointers to the struct? If you don't, you cannot represent a null string (IE a missing value) differently to an empty string.

How do callers free this string? You have to mandate that they use a special stringFree function, or rely on callers first freeing the pointer field and then freeing the struct.

Composite data types are a lot more work and are more error prone in C.

We're very much in agreement.

The whole 'null pointer style strings' makes no sense, I think they want to say 'nul terminated'. But fine.

Your examples are excellent, let me add a few more:

Big endian? Little endian? Do we count characters or bytes? Who owns the bloody thing? Can they be modified in place? Are they in ROM or RAM? Automatic? Static? Can they be transmitted over a network 'as is' or do they need to be sent via some serialization mechanism? What about storing them on disk? And can they then be retrieved on different architectures?

The problem really is that C more or less requires you to really know what you're doing with your data and that's impossible in a networked world because your toy library ends up integrated into something else and then that something else gets connected to the internet and suddenly all those negative test cases that you never thought of are potential security issues. So any simplistic view of string handling will end up with a broken implementation regardless of how well it worked in its initial target environment.

C's solution is simple: take the simplest possible representation and use that, pass responsibility back to the programmer for dealing with all of the edge cases. The problem is that nobody does and even those that try tend to get it subtly wrong several times across a codebase of any magnitude.

It's a nasty little problem and it will result in security issues for decades to come. There are plenty of managed languages, I had some hope (as a seasoned C programmer) that instead of this Cambrian explosion of programming languages that we'd have some kind of convergence so that it becomes easier, not harder to pick a winner and establish some best practices. But it seems as though cooperation is rare, much more common is the mode where a defect in one language or eco system results in a completely new language that solves that one problem in some way (sometimes quite convoluted) at the expense of introducing a whole raft of new problems. Besides the fractioning of mindshare.

  • It's not a hypothesis, the thing was already implemented many times in C, C++ and other languages and used for ages especially for networked code, because C "there's no length" approach is a guaranteed vulnerability.

    • It's not a guaranteed vulnerability, it's a potential vulnerability.

      Guaranteed doesn't mean "this will probably happen", it means "this will definitely happen".

      The "no length approach" can probably result in a vulnerability. It won't definitely result in a vulnerability.

      I mean, come one, if it was a guaranteed vulnerability, almost nothing on the internet would work because they all have, somewhere down the line, a dependency on a nul-terminated string.

      I mean, do you think that nginx (https://github.com/nginx/nginx/blob/master/src/core/ngx_stri...) is getting exploited millions of times per hour because they have a few uses for nul-terminated strings?

      8 replies →

    • Which C compilers are those then?

      Also, you keep writing 'null pointer' and 'null', there is a pretty big difference between 'null' and 'nul' and in the context of talking about language implementation details such little things matter a lot. You say a lot of stuff with great authority that simply doesn't match my experience (as a C programmer of many decades) and while I'm all open to being convinced otherwise you will have to show some references and examples.

      7 replies →

>32bit or 64bit length? Signed or unsigned? It doesn't make sense to have a signed length.

32 bit should be enough for everyone, it's easier to type as int, and you have less problems with variable sized integers on different targets. Signed length makes sense because length is a number, and numbers are signed, also in conjunction with array -1 sentinel value is often used.

>If you don't, you cannot represent a null string (IE a missing value) differently to an empty string.

C++ can't do it either with std::string and sky doesn't fall, because such distinction is rarely needed and for business logic empty string means absence of value, actually in languages with nullable strings null string and empty string are routinely synonymous and you often use a method like IsNullOrEmpty to check for absence of value. Anyway you need the concept of absence for other types too, like int, so string isn't special here.

>You have to mandate that they use a special stringFree function, or rely on callers first freeing the pointer field and then freeing the struct.

pointer+length struct is a value type, see https://en.cppreference.com/w/cpp/container/span

  • > C++ can't do it either with std::string and sky doesn't fall, because such distinction is rarely needed and for business logic empty string means absence of value,

    Incorrect. I'm literally, today, working on a project where the business logic is different depending on whether an empty string is stored in the database, or no string.

    "User didn't get to fill in a preference" is very different from "user didn't indicate a preference".

    In more practical terms, a missing value could mean that we use the default while an empty value could mean that we don't use it at all.

    • For user empty text field means absence of value. Indeed, rarely a situation arises for optional values, but it's not only for strings, other types like int may need it too.

      1 reply →