← Back to context

Comment by eru

2 years ago

Well, the wider problem then is using C.

Pretty much all operating system APIs use C-style zero-terminated strings. So while C may be historically responsible for the problem, not using C doesn't help much if you need to talk to OS APIs.

  • not using C doesn't help much if you need to talk to OS APIs

    This means cdecl, stdcall or whatever modern ABIs OSes use, not C. Many languages and runtimes can call APIs and DLLs, though you may rightfully argue that their FFI or wrappers were likely compiled from C using the same ABI flags. But ABI is no magic, just a well-defined set of conventions.

    And then, no one prohibits to use length-aware strings and either have safety null at the end or only copy to null-terminated before a call. Most OS calls are usually io-bound and incomparably heavy anyway.

    • The problem is, a null-terminated string is a very simple concept for an ABI. A string with a length count seems simple, but there is a big step up in complexity, and you can't just wistfully imagine effortlessly passing String objects around to your ABI.

      For a start, String objects are going to be different everywhere. Even in C++, one library's String object isn't going to be binary compatible with another. How is the data laid out, does it do small string optimisation, etc? Are there other internal fields?

      So you won't be passing objects around. At the ABI, you'll have to pass a pointer and a length. Calling an ABI will involve unwrapping and wrapping objects to pretend you are dealing with 'your' strings. Simple C-style ABIs make memory management straightforward (n.b. but error-prone, and certainly not easy). If this new style ABI returns a 'string' (pointer and length) of some sort, you have to package it up in your own object format, and manage the memory. Will you need an extra object type to represent 'string I got from an ABI, whose memory is managed differently'?

      None of these are insurmountable, but they are a complexity that is rarely thought of when people declare 'C style ABIs are terrible!'

      33 replies →

  • I suspect null terminated strings predate C, C is just just one of many languages that can use them.

    • The PDP-10 and PDP-11 assemblers had direct support for nul-terminated strings (ASCIZ directives, and OUTSTR in MACRO10) which Ritchie adopted as-is, not unlike Lisp’s CAR/CDR. It’s not entirely clear that other “high-level” languages at the time also used such a type.

      Although later ISA added support for it for C compatibility, whereas older ISAs tended to only support fixed-length or length-prefixed, for instance the Z80 has LDIR, which is essentially a memcpy, copying a terminated string required a manual loop.

  • All non-dynamic string representations give rise to the situations where programmers need to combine strings that don't fit into the destination.

    Whether null-terminated or not, dynamic strings that solve the problem of being able to add two strings together without worrying whether the destination buffer is large enough (trading that problem for DoS concerns when a malicious agent may feed a huge input to the program).

  • Nothing prevents those operating systems from offering custom string types.

    • In reality, a ton of stuff does. As an example: What do you do if someone calls your new string+length API with an embedded \0 character? Your internal functions are all still written in C and using char* so they will silently truncate the string. So you need to check and reject that. Except you forgot there are also APIs (like the extended attrs APIs) that do accept embedded \0. The exceptions are all over the place, in ioctl calls passed to weird device drivers etc.

      1 reply →

As a user posting from a Linux machine, I disagree. Though it seems the "don't use C" crowd often delegate the important decisions to somewheres else.

I guess the answer is "some people's C is good enough, but not yours"

  • If the problem is "you're using nul-terminated strings" as the GP said, then "don't use C" a good step towards fixing that problem, no?

C the needle contaminated now often with deadly RCE virus. Historically it was used to inject life into the first bytes of the twisted self perpetuating bootstrapping chain of an eco system dominating today the planet and the space around it.

All processors are C VMs at the end of the day. They are designed for it, and it's a great language to access raw hardware and raw hardware performance.

I still fail to label C as evil.

P.S.: Don't start with all memory management and related stuff. We have solutions for these everywhere, incl., but not limited to GCs, Rust, etc. Their existence do not invalidate C, and we don't need to abandon it. Horses for courses.

  • > All processors are C VMs at the end of the day.

    That would be a poor argument back in the 80s; and is increasingly wrong for modern processors. Compiler intrinsics can paper-over some of the conceptual gap, but dropping down to inline assembly can't be entirely eliminated (even if it's relegated to core libraries). Lots of C code relies on certain patterns compiling down to specific instructions, e.g. for vectorising; since C itself has no concept of such things. C is based around a 1D memory model which has no concept of cache hierarchies. C has no representation of branch prediction, out-of-order instructions, or pipelines; let alone hyperthreading or multi-core programming.

    After all, if processors were "C VMs", then GCC/LLVM/etc. wouldn't be such herculean feats of engineering!

    • This is a subject I love to discuss.

      Exactly. C is based around 1D memory, has no understanding of caches. All of your other arguments are true, too.

      This is why most of the things; caches, memory hierarchies and other modern things are hidden from C (and other languages, or software in general) itself, to trick C, and make it think it's still running on a PDP-11.

      All caches (L1, L2, L3, even disk and caches, and various caches built in RAM) are handled by hardware or OS kernels themselves. Unless they provide an API to talk with, they are invisible and untouchable, unmanageable, and this is by design (esp. the ones baked into hardware like Lx and other buffers).

      All the compilers are the interface perpetuating this smoke and mirrors to not upset C about its assumptions about the machine underlying itself. Even then, a compiler can only command the processor upto a certain point. You can't say that I want these in caches, and evict these. These are automagic processes.

      Exactly, because of these reason, CPUs are C VMs. They do work completely different than a PDP-11, but behave like one at the uppermost level, where compilers are the topmost layer in this toolchain.

      Compilers are such a herculean feats of engineering, because we need to trick that the programs we're building, to make them think they're running on a much simpler hardware. In turn, hardware tries hard to keep this management ovherhead handled by compilers at a bare minimum while allowing higher and higher performance.

      More ponderings, and foundation of my assertion is here: https://dl.acm.org/doi/10.1145/3212477.3212479

      Paper is titled: C Is Not a Low-level Language: Your computer is not a fast PDP-11.

      2 replies →

  • This is backwards. C was conceived as a way to do the things programmers were already doing in assembler, but with high(er) level language conveniences. In turn , the things they were doing in assembler were done to efficiently use the "VM" their code was executed on.

    • I have linked a paper published in ACM Queue in another comment of mine, which discusses this in depth.

      The gist is, hardware and compilers are hiding all the complexity from C and other programming languages while trying to increase performance, IOW, emulating a PDP-11 while not being a PDP-11.

      This is why C and its descendants are so long lived and performs very well on these systems despite the traditional memory models and simple system models they employ.

      IOW, modern hardware and development tooling creates and environment akin to PDP-11, not unlike VMs emulate other hardware to make other OSes happy.

      So, at the end of the day, processors are C VMs, anyway.