Comment by proto_lambda

3 years ago

The problem here is that the thing that "can't happen" isn't actually something that can't happen, it's something that isn't allowed to happen according to a many-hundred-page document that approximately nobody reads. It's not something that can be optimised because the compiler can prove it cannot happen, it is allowed to be optimised because the standard says "dear programmer, if you ever make this happen, god help you".

I think this view is slightly unfair. I think of UB as the compiler saying "when you promised this thing wouldn't happen, I took you at your word. If bad things happen because you lied, they're your fault, not mine."

  • Lying requires intent. This was a mistake, something that humans are well-known for making, and if the compiler is designed to assume otherwise, it borders on useless in the real world.

    • A compiler can’t know why you fucked up, it can’t even know that you fucked up, because UBs are just ways for it to infer and propagate constraints.

      If an optimising C compiler can’t rely on UBs not happening, its potential is severely cut down due to the dearth of useful information provided by C’s type system.

      10 replies →

    • C has always considered that the programmer knows what they are doing. Programs are assumed correct unless proven invalid.

      This is -- or at least was -- a feature, not a bug. You can implement any valid program, but you can also implement some invalid programs.

      I know the OP mentioned Rust, but it's a valid comparison: if you don't invoke "unsafe" then all your behaviour is well-defined. But the trade-off is that Rust will only let you implement a subset of valid programs unless you invoke "unsafe", which might be better termed "assumed correct".

    • That's C for you. If you want something saner, use Rust or Haskell or Python or even Java or Go or.. almost any other language that's not C or C++.

      These days the whole point of C is this Faustian pact with the devil of speed for sanity.

      5 replies →

    • And yet C is still the dominant language. Undefined behavior is actually the reason why: any defined behavior is expensive to implement in the compiler and possibly incurs a cost at runtime. The language design intentionally trades programmer’s sanity for ease of implementation.

      4 replies →

  • I don't think this is a fair comparison as this is all based on implicit inferences by the compiler.

    If the programmer had specifically invoked the "__assert_valid_pointer(p)" standard function (which does not exists) to promise the compile that the pointer was valid then it would be fine.

    The problem is that there are a lot of places where the compiler makes these assumptions.

  • This is a good positive model of UB, fulfilling the compiler assumptions.

  • Does the compiler know? If so, can't they have a flag that doesn't allow UB?

    • Doesn't allow optimization enabled by this specific UB: Yes Does't allow UB: Hard, because you probably need runtime checks.

Even if someone would read all those pages, constraining ourselves to ISO C only, no way that after an year they would still remeber the about 200 UB cases that are documented there.

Which is why everyone should adopt static analysis tooling and enable all the warnings that are related to UB, pointer and casts misuses.

Many think they know better, it is like those that think builders don't need protection gear at a construction site, it is stuff only for the weak.

  • I think implicitly compiler-added runtime check are a more robust and reliable solution than static analysis. For example for pointer dereferences the compiler should could 0-offset dummy load if the load is not guaranteed to be within a page of the pointer. Or adding abort-on-overflow for math. Or bound checking where possible.

    It will have a non-trivial cost, but hopefully aggressive optimizations can remove many of these checks (which ironically it is exactly the kind of optimizations people are complaining about) and compilers provide pragmas to disable them when critical.

    In a way sanitizers are getting there, but they are explicitly marked as for non-production use which is a problem.

    • I agree, but unfortunely that will never happen in most C and C++ circles, just see the heat JF Bastien has been facing for a feature that has been shipping in Windows and Android for the last two years, proven in the battlefield to hardly hinder performance in real use cases.

      https://isocpp.org/files/papers/P2723R0.html

      Lots of people telling him it will never fly in production, while their Windows and Android phones are using the code that they say isn't good enough.

      1 reply →

Except this can't happen happens many and many times in practice so maybe it's time the language bureaucrats got off their high horse (but they won't)

It's still braindead and idiotic. Every relevant platform nowadays has well defined overflow for signed ints. A sane C compiler should go with that and base its optimizations on it. GCC has been a pile of garbage in this regard for many years now. Its devs get further removed from reality with every year. Treating signed int overflow as undefined should be hidden behind a flag.

  • The C/C++ language doesn't provide for a way for the compiler to see that you really meant this one check to take precedence over the implicit promise in another.

    The reason why C++ is always relevant here (though C macros and inlining cause similar issues) is that generic programming being close to optimal is a language feature - and one of the ways that's possible is by letting you right reusable code that might be "called" from a context in which some of the checks or conditions just aren't necessary. It's by design that the optimizer gets to... well, optimize that kind of code.

    There's a solid case to be made that the details of C's UB weren't well chosen and we should try to update them; but which decades old choices are perfect? Which are easy to change once there's this much legacy software in operation?

    Don't forget that some of those UB's were chosen to deal with hardware realities of the day; i.e. that the "same" operation on different hardware would do different things. For example, eliminating signed integer overflow might allow a C compiler to use a signed register that's wider than necessary, which may help on hardware that doesn't have every possible register width, or where there are complex register usage limitations. I'm no hardware geek; I'm sure somebody here knows or real examples where UB allows portability, because that's the point: UB allows people to write portable, performant code - just don't do certain things, and you're fine... which leads us to today's situation, in which UB can feel like a minefield.

    • > Don't forget that some of those UB's were chosen to deal with hardware realities of the day; i.e. that the "same" operation on different hardware would do different things.

      That's an argument for implementation defined behaviour. Not for undefined behaviour, at least not UB in the modern sense.

      2 replies →

    • The problem is not UB per se -- the problem is that the compiler uses UB to make assumptions that are incorrect.

      Removing a comparison because of UB is fucking stupid. The compiler on the one hand assumes that the programmer is diligent enough to consider of every invocation of UB, but on the other hand too stupid to see the check they wrote will always be true.

      It's not a good idea.

      6 replies →

  • Signed int overflow being UB is one of the most basic UBs of the language, and what allows generating tight code in loops.

    This is not new, -fwrapv was introduced in 2003, but it can quite severely impact code quality, if you don’t care, just set that. Then complain that C is slow, because C is a shit language.

    • > and what allows generating tight code in loops.

      How so? How does breaking an if statement the programmer added make the code faster? If they intended the check not to happen/be required, they wouldn't have written it. Let signed int overflow and leave any code that depends on its value alone. So yes maybe make fwrapv the default.

      > because C is a shit language.

      Well, it's as low level as it can get before reaching assembly, but why not try reducing the number of foot guns? Sometimes you still need C, and that's not going to go away for the foreseeable future.

      7 replies →

  • It's not about what your CPU does.

    These days undefined overflow for signed integers is mostly used by compilers to be able to assume that eg 'a + 1 > a' is always true, and thus eliminate redundant checks.

    (And you wouldn't typically write code like 'a + 1 > a', but you can get either from code generation via macros etc or as a intermediate result from previous optimization passes.)

Basically, the compiler implements integer addition using an operation that doesn't match the semantics of integer addition in the standard, then hallucinates that it did. That is:

1) The compiler sees an expression like "a += b;" where a and b are signed integers.

2) It emits "add rA rB" in x86 assembly (rA/B being the register a/b is currently in).

3) Technically the machine code emitted does not match the semantics of the source code, since it uses wraparound addition, whereas the C standard says that for the operation to be valid, the values of a and b must be such that no overflow would occur. This is fine however, because the implementation has the freedom to do anything on integer overflow, including just punting the problem to hardware as it did in this case.

4) The compiler proceeds with the rest of the code as if the line above would never overflow. My brother in the machine spirit, you chose to translate my program to a form where integer overflow is defined.

The compiler should either a) trap on integer overflow; or b) accept integer overflow. It will be fine if it chooses either a) or b) situationally, i.e. if we have a loop where assuming no overflow is faster, then by all means - add a precondition check and crash the program if it's false, but don't just assume overflow doesn't happen when you explicitly emit code with well-defined overflow semantics.

The bigger problem is there is pretty much no way to guard against this. The moment your program is longer than one page you're screwed. You may think all your functions are fine, but then you call something from some library, the compiler does some inlining and suddenly there's an integer overflow where you didn't expect, leading to your bounds check being deleted.