Undefined Behavior in C and C++

4 days ago (russellw.github.io)

One has to add that from the 218 UB in the ISO C23, 87 are in the core language. From those we already removed 26 and are in progress of removing many others. You can find my latest update here (since then there was also some progress): https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3529.pdf

  • A lot of that work is basically fixing documentation bugs, labelled "ghosts" in your text. Places where the ISO document is so bad as a description of C that you would think there's Undefined Behaviour but it's actually just poorly written.

    Fixing the document is worthwhile, and certainly a reminder that WG21's equivalent effort needs to make the list before it can even begin that process on its even longer document, but practical C programmers don't read the document and since this UB was a "ghost" they weren't tripped by it. Removing items from the list this way does not translate to the meaningful safety improvement you might imagine.

    There's not a whole lot of movement there towards actually fixing the problem. Maybe it will come later?

A couple of solutions in development (but already usable) that more effectively address UB:

i) "Fil-C is a fanatically compatible memory-safe implementation of C and C++. Lots of software compiles and runs with Fil-C with zero or minimal changes. All memory safety errors are caught as Fil-C panics." "Fil-C only works on Linux/X86_64."

ii) "scpptool is a command line tool to help enforce a memory and data race safe subset of C++. It's designed to work with the SaferCPlusPlus library. It analyzes the specified C++ file(s) and reports places in the code that it cannot verify to be safe. By design, the tool and the library should be able to fully ensure "lifetime", bounds and data race safety." "This tool also has some ability to convert C source files to the memory safe subset of C++ it enforces"

  • Fil-C is interesting because as you'd expect it takes a significant performance penalty to deliver this property, if it's broadly adopted that would suggest that - at least in this regard - C programmers genuinely do prioritise their simpler language over mundane ideas like platform support or performance.

    The resulting language doesn't make sense for commercial purposes but there's no reason it couldn't be popular with hobbyists.

Undefined behavior only means that ISO C doesn't give requirements, not that nobody gives requirements. Many useful extensions are instances where undefined behavior is documented by an implementation.

Including a header that is not in the program, and not in ISO C, is undefined behavior. So is calling a function that is not in ISO C and not in the program. (If the function is not anywhere, the program won't link. But if it is somewhere, then ISO C has nothing to say about its behavior.)

Correct, portable POSIX C programs have undefined behavior in ISO C; only if we interpret them via IEEE 1003 are they defined by that document.

If you invent a new platform with a C compiler, you can have it such that #include <windows.h> reformats all the attached storage devices. ISO C allows this because it doesn't specify what happens if #include <windows.h> successfully resolves to a file and includes its contents. Those contents could be anything, including some compile-time instruction to do harm.

Even if a compiler's documentationd doesn't grant that a certain instance of undefined behavior is a documented extension, the existence of a de facto extension can be inferred empirically through numerous experiments: compiling test code and reverse engineering the object code.

Moreover, the source code for a compiler may be available; the behavior of something can be inferred from studying the code. The code could change in the next version. But so could the documentation; documentation can take away a documented extension the same way as a compiler code change can take away a de facto extension.

Speaking of object code: if you follow a programming paradigm of verifying the object code, then undefined behavior becomes moot, to an extent. You don't trust the compiler anyway. If the machine code has the behavior which implements the requirements that your project expects of the source code, then the necessary thing has been somehow obtained.

  • Unfortunely it also means that when the programmer fails to understand what undefined behaviour is exposed on their code, the compiler is free to take advantage of that to do the ultimate performance optimizations as means to beat compiler benchmarks.

    The code change might come in something as innocent as a bug fix to the compiler.

  • > Undefined behavior only means that ISO C doesn't give requirements, not that nobody gives requirements. Many useful extensions are instances where undefined behavior is documented by an implementation.

    True, most compilers have sane defaults in many cases for things that are technically undefined (like take sizeof(void) or do pointer arithmetic on something other than a char). But not all of these cases can be saved by sane defaults.

    Undefined behavior means the compiler can replace the code with whatever. So if you e.g. compile optimizing for size, the compiler will rip out the offending code, as replacing it with nothing yields the greatest size optimization.

    See also John Regehr's collection of UB-Canaries: https://github.com/regehr/ub-canaries

    Snippets of software exhibiting undefined behavior, executing e.g. both the true and the false branch of an if-statement or none etc. UB should not be taken lightly IMO...

>Uninitialized data

They at least fixed this in c++26. No longer UB, but "erroneous behavior". Still some random garbage value (so an uninitialized pointer will likely lead to disastrous results still), but the compiler isn't allowed to fuck up your code, it has to generate code as if it had some value.

  • It won't be a "random garbage value" but is instead a value the compiler chose.

    In effect if you don't opt out your value will always be initialized but not to a useful value you chose. You can think of this as similar to the (current, defanged and deprecated as well as unsafe) Rust std::mem::uninitialized()

    There were earlier attempts to make this value zero, or rather, as many 0x00 bytes as needed, because on most platforms that's markedly cheaper to do, but unfortunately some C++ would actually have worse bugs if the "forgot to initialize" case was reliably zero instead.

  • C also fixed it in its way.

    Access to an uninitialized object defined in automatic storage, whose address is not taken, is UB.

    Access to any uninitialized object whose bit pattern is a non-value, likewise.

    Otherwise, it's good: the value implied by the bit pattern is obtained and computation goes on its merry way.

We switched to Rust. Generally, are there specific domains or applications where C/C++ remain preferable? Many exist—but are there tasks Rust fundamentally cannot handle or is a weak choice?

  • Yes, all the industries where C and C++ are the industry standards like Khronos APIs, POSIX, CUDA, DirectX, Metal, console devkits, LLVM and GCC implementation,....

    Not only you are faced with creating your own wrappers, if no one else has done it already.

    The tooling, for IDEs and graphical debuggers, assumes either C or C++, so it won't be there for Rust.

    Ideally the day will come where those ecosystems might also embrace Rust, but that is still decades away maybe.

  • Rust encourages a rather different "high-level" programming style that doesn't suit the domains where C excels. Pattern matching, traits, annotations, generics and functional idioms make the language verbose and semantically-complex. When you follow their best practices, the code ends up more complex than it really needs to be.

    C is a different kind of animal that encourages terseness and economy of expression. When you know what you are doing with C pointers, the compiler just doesn't get in the way.

  • Rust forces you to code in the Rust way, while C or C++ let you do whatever you want.

    • > C or C++ let you do whatever you want.

      C and C++ force you to code in the C and C++ ways. It may that that's what you want, but they certainly dont let me code how I want to code!

  • Yes, based on a few attempts chronicled in articles from different sources, Rust is a weak choice for game development, because it's too time-consuming to refactor.

  • Advantages of C are short compilation time, portability, long-term stability, widely available expertise and training materials, less complexity.

    IMHO you can today deal with UB just fine in C if you want to by following best practices, and the reasons given when those are not followed would also rule out use of most other safer languages.

    • This is a pet peeve, so forgive me: C is not portable in practice. Almost every C program and library that does anything interesting has to be manually ported to every platform.

      C is portable in the least interesting way, namely that compilers exist for all architectures. But that's where it stops.

    • > short compilation time

      > IMHO you can today deal with UB just fine in C if you want to by following best practices

      In the other words, short compilation time has been traded off with wetware brainwashing... well, adjustment time, which makes the supposed advantage much less desirable. It is still an advantage, I reckon though.

  • Rust can do inline ASM, so finding a task Rust "fundamentally cannot handle" is almost impossible.

  • I haven't used Rust extensively so I can't make any criticism besides that I find compilation times to be slower than C

    • I find with C/++ I have to compile to find warnings and errors, while with Rust I get more information automatically due to the modern type and linking systems. As a result I compile Rust significantly less times which is a massive speed increase.

      Rusts tooling is hands down better than C/++ which aids to a more streamlined and efficient development experience

      2 replies →

    • The popular C compilers are seriously slow, too. Orders of magnitude compared to C compilers of yesteryear.

In C, using uninitialized data is undefined behavior only if:

- it is an automatic variable whose address has not been taken; or

- the uninitialized object' bits are such that it takes on a non-value representation.

I don’t buy the “it’s because of optimization argument”.

And I especially don’t buy that UB is there for register allocation.

First of all, that argument only explains UB of OOB memory accesses at best.

Second, you could define the meaning of OOB by just saying “pointers are integers” and then further state that nonescaping locals don’t get addresses. Many ways you could specify that, if you cared badly enough. My favorite way to do it involves saying that pointers to locals are lazy thunks that create addresses on demand.

Rust here rust there. We are just talking about C not rust. Why we have to using rust. If you talking memory safety why there is no one recommends Ada language instead of rust.

We have zig, Hare, Odin, V too.