← Back to context

Comment by t43562

1 day ago

I've always wondered at the motivatons of the various string routines in C - every one of them seems to have some huge caveat which makes them useless.

After years I now think it's essential to have a library which records at least how much memory is allocated to a string along with the pointer.

Something like this: https://github.com/msteinert/bstring

> I've always wondered at the motivatons of the various string routines in C

This idiom:

    char hostname[20];
    ...
    strncpy( hostname, input, 20 );
    hostname[19]=0;

exists because strncpy was invented for copying file names that got stored in 14-byte arrays, zero terminated only if space permitted (https://stackoverflow.com/a/1454071)

  • Technically strncpy was invented to interact with null-padded fixed-size strings in general. We’ve mostly (though not entirely) moved away from them but fixed-size strings used to be very common. You can see them all over old file formats still.

  • It’s also horrible because each project ends up reinventing their own abstractions or solutions for dealing with common things.

    Destroys a lot of opportunity for code reuse / integration. Especially within a company.

    Alternatively their code base remains a steaming pile of crap riddled with vulnerabilities.

    • That's how everything works. You start off with some atomics and build up from there. Things that people like get standardized, And before you know what's going on it's called stdlib.

      It took a decade between Stroustrup's 1985 book "The C++ Programming Language" and the STL proposed and accepted by the ANSI/ISO committee in 1994.

  • I've always assumed that the n in strncpy was meant to signify a max length N. Now I'm wondering if it might have stood for NUL padding.

Yes, not having a length along with the string was a mistake. It dates from an era where every byte was precious and the thought of having two bytes instead of one for length was a significant loss.

I have long wondered how terrible it would have been to have some sort of "varint" at the beginning instead of a hard-coded number of bytes, but I don't have enough experience with that generation to have a good feel for it.

>every one of them seems to have some huge caveat which makes them useless

They were added into C before enough of the people designing it knew the consequences they would bring. Another fundamentally broken oversight is array-to-pointer demotion in function signatures instead of having fat pointer types.

It's from a time before computer viruses no?

But also all of this book-keeping takes up extra time and space which is a trade-off easily made nowadays.

  • Yes, in the old times if you crashed a program or whole computer with invalid input, it was your fault.

    Viruses did exist, and these were considered users' fault too.

strncpy is fairly easy, that's a special-purpose function for copying a C string into a fixed-width string, like typically used in old C applications for on-disk formats. E.g. you might have a char username[20] field which can contain up to 20 characters, with unused characters filled with NULs. That's what strncpy is for. The destination argument should always be a fixed-size char array.

A couple years ago we got a new manual page courtesy of Alejandro Colomar just about this: https://man.archlinux.org/man/string_copying.7.en

  • strncpy doesn’t handle overlapping buffers (undefined behavior). Better to use strncpy_s (if you can) as it is safer overall. See: https://en.cppreference.com/w/c/string/byte/strncpy.html.

    As an aside, this is part of the reason why there are so many C successor languages: you can end up with undefined behavior if you don’t always carefully read the docs.

    • > strncpy doesn’t handle overlapping buffers (undefined behavior).

      It would make little sense for strncpy to handle this case, since, as I pointed out above, it converts between different kinds of strings.

    • Back when strncpy was written there was no undefined behaviour (as the compiler interprets it today). The result would depend on the implementation and might differ between invocations, but it was never the "this will not happen" footgun of today. The modern interpretation of undefined behaviour in C is a big blemish on the otherwise excellent standards committee, committed (hah) in the name of extremely dubious performance claims. If "undefined" meaning "left to the implementation" was good enough when CPU frequency was measured in MHz and nobody had more than one, surely it is good enough today too.

      Also I'm not sure what you mean with C successor languages not having undefined behaviour, as both Rust and Zig inherit it wholesale from LLVM. At least last I checked that was the case, correct me if I am wrong. Go, Java and C# all have sane behaviour, but those are much higher level.

      9 replies →

  • Yes, these were also common in several wire formats I had to use for market data/entry.

    You would think char symbol[20] would be inefficient for such performance sensitive software, but for the vast majority of exchanges, their technical competencies were not there to properly replace these readable symbol/IDs with a compact/opaque integer ID like a u32. Several exchanges tried and they had numerous issues with IDs not being "properly" unique across symbol types, or time (restarts intra-day or shortly before the open were a common nightmare), etc. A char symbol[20] and strncpy was a dream by comparison.

  • A big footgun with strncpy is that the output string may not be null terminated.

    • Yeah but fixed width strings don’t need null termination. You know exactly how long the string is. No need to find that null byte.

      14 replies →

  • Isn't strlcpy the safer solution these days?

    • I don't think anybody in this thread read the article.

      Strlcpy tries to improve the situation but still has problems. As the article points out it is almost never desirable to truncate a string passed into strXcpy, yet that is what all of those functions do. Even worse, they attempt to run to the end of the string regardless of the size parameter so they don't even necessarily save you from the unterminated string case. They also do loads of unnecessary work, especially if your source string is very long (like a mmaped text file).

      Strncpy got this behavior because it was trying to implement the dubious truncation feature and needed to tell the programmer where their data was truncated. Strlcpy adopted the same behavior because it was trying to be a drop in replacement. But it was a dumb idea from the start and it causes a lot of pain unnecessarily.

      The crazy thing is that strcpy has the best interface, but of course it's only useful in cases where you have externally verified that the copy is safe before you call it, and as the article points out if you know this then you can just use memcpy instead.

      As you ponder the situation you inevitably come to the conclusion that it would have been better if strings brought along their own length parameter instead of relying on a terminator, but then you realize that in order to support editing of the string as well as passing substrings you'll need to have some struct that has the base pointer, length, and possibly a substring offset and length and you've just re-invented slices. It's also clear why a system like this was not invented for the original C that was developed on PDP machines with just a few hundred KB of RAM.

      Is it really too late for the C committee to not develop a modern string library that ships with base C26 or C27? I get that they really hate adding features, but C strings have been a problem for over 50 years now, and I'm not advocating for the old strings to be removed or even deprecated at this time. Just that a modern replacement be available and to encourage people to use them for new code.

      3 replies →

Yet software developed in C, with all of the foibles of its string routines, has been sold and running for years with trillions of USD is total sales.

A library that records how much memory is allocated to a string along with the pointer isn't a necessity.

Most people who write in C professionally are completely used to it although the footgun is (and all of the others are) always there lurking.

You'd generally just see code like this:-

    char hostname[20];
    ...
    strncpy( hostname, input, 20 );
    hostname[19]=0;

The problem obviously comes if you forget the line to NUL that last byte AND you have a input that is greater than 19 characters long.

(It's also very easy to get this wrong, I almost wrote `hostname[20]=0;` first time round.)

I remember debugging a problem 20+ years ago on a customer site with some software that used Sybase Open/Server that was crashing on startup. The underlying TDS communications protocol (https://www.freetds.org/tds.html) had a fixed 30 byte field for the hostname and the customer had a particularly long FQDN that was being copied in without any checks on its length. An easy fix once identified.

Back then though the consequences of a buffer overrun were usually just a mild annoyance like a random crash or something like the Morris worm. Nowadays such a buffer overrun is deadly serious as it can easily lead to data exfiltration, an RCE and/or a complete compromise.

Heartbleed and Mongobleed had nothing to do with C string functions. They were both caused by trusting user supplied payload lengths. (C string functions are still a huge source of problems though.)

  • > Yet software developed in C, with all of the foibles of its string routines, has been sold and running for years with trillions of USD is total sales.

    This doesn't seem very relevant. The same can be said of countless other bad APIs: see years of bad PHP, tons of memory safety bugs in C, and things that have surely led to significant sums of money lost.

    > It's also very easy to get this wrong, I almost wrote `hostname[20]=0;` first time round.

    Why would you do this separately every single time, then?

    The problem with bad APIs is that even the best programmers will occasionally make a mistake, and you should use interfaces (or...languages!) that prevent it from happening in the first place.

    The fact we've gotten as far as we have with C does not mean this is a defensible API.

    • Sure, the post I was replying to made it sound like it's a surprise that anything written in C could ever have been a success.

      Not many people starting a new project (commercial or otherwise) are likely to start with C, for very good reason. I'd have to have a very compelling reason to do so, as you say there are plenty of more suitable alternatives. Years ago many of the third party libraries available only had C style ABIs and calling these from other languages was clumsy and convoluted (and would often require implementing cstring style strings in another language).

      > Why would you do this separately every single time, then?

      It was just an illustration or what people used to do. The "set the trailing NUL byte after a strncpy() call" just became a thing lots of people did and lots of people looked for in code reviews - I've even seen automated checks. It was in a similar bucket to "stuff is allocated, let me make sure it is freed in every code path so there aren't any memory leaks", etc.

      Many others would have written their own function like `curlx_strcopy()` in the original article, it's not a novel concept to write your own function to implement a better version of an API.

  • I learned C in about 1989/1990 and have used it a lot since then. I have worked on a fair amount of rotten commercial C code, sold at a high price, in which every millimeter of extra functionality was bought with sweat and blood. I once spent a month finding a memory corruption issue that happened every 2 weeks with a completely different stack trace which, in the end, required a 1-line fix.

    The effort was usually out of proportion with the achievement.

    I crashed my own computer a lot before I got Linux. Do you remember far pointers? :-( In those days millions of dollars were made by operating systems without memory protection that couldn't address more than 640k of memory. One accepted that programs sometimes crashed the whole computer - about once a week on average.

    Despite considering myself an acceptable programmer I still make mistakes in C quite easily and I use valgrind or the sanitizers quite heavily to save myself from them. I think the proliferation of other languages is the result of all this.

    In spite of this I find C elegant and I think 90% of my errors are in string handling so therefore if it had a decent string handling library it would be enormously better. I don't really think pure ASCIIZ strings are so marvelous or so fast that we have to accept their bullshit.

    • > I learned C in about 1989/1990 and have used it a lot since then. I have worked on a fair amount of rotten commercial C code, sold at a high price, in which every millimeter of extra functionality was bought with sweat and blood. I once spent a month finding a memory corruption issue that happened every 2 weeks with a completely different stack trace which, in the end, required a 1-line fix.

      That sums up one of my old roles where this kind of thing accounted for about 10% of my time over a 10 year period.

      Heisenbug, mutating stack traces, weeks between occurrences, 1 line fix, do some other interesting work before the next weird thing comes along.

      I think the longest running one I had (several years) was some weird interaction between pthread_cond_wait() and pthread_cond_broadcast(). Ugh.

  • > The underlying TDS communications protocol (https://www.freetds.org/tds.html) had a fixed 30 byte field for the hostname and the customer had a particularly long FQDN that was being copied in without any checks on its length. An easy fix once identified.

    I had to file a bug with a vendor because their hostname handling had a similar issue: I think it was 64 max.

    There was some pushback about if it was "really" a problem, so I ended up quoting the relevant RFCs to argue that they were not compliant with Internet standards, and eventually they fixed the issue.

  • > Yet software developed in C, with all of the foibles of its string routines, has been sold and running for years with trillions of USD is total sales.

    Even with the premise that sales of software is a good metric for analyzing design of the language (which I think is arguable at best), we don't know that even more money might have been made with better strings in C. You coming justify pretty much anything with that argument. MongoDB (which indicentally is on C++ and presumably makes plenty of use of std::string) made millions of dollars despite having the bug you mention, so why bother fixing it?

    • That wasn't really the point I was making.

      It was more a response to the OP's comment of:

      > I've always wondered at the motivatons of the various string routines in C - every one of them seems to have some huge caveat which makes them useless.

      Which, to me, sounded like it was a surprise that anything written in C could be a success at all given that something as basic as the string handling (which is pretty fundamental) is bordering on useless.

      As I put in my other comment, there were plenty of reasons way back in the 80s/90s why C was chosen for a lot of software, and hardly any (if any at all) of those reasons remain nowadays.

      > MongoDB (which indicentally is on C++ and presumably makes plenty of use of std::string) made millions of dollars despite having the bug you mention, so why bother fixing it?

      Again, that's not the point I was making, no-one said anything about not fixing something because of how much money it has made.

      All it comes down to is that a lot of very successful software is very shoddily written, in a variety of languages, not just notoriously memory-unsafe languages like C. Well written software, or software written in a "better" language, might have a better chance of "succeeding" (whatever that means), but that doesn't mean that awful software can't succeed.

      1 reply →

  • >>(It's also very easy to get this wrong, I almost wrote `hostname[20]=0;` first time round.)

    Impossible to get wrong with a modern compiler that will warn you on that or LSP that will scream the moment you type ; and hit enter/esc.