← Back to context

Comment by ncruces

6 hours ago

And yet, I see P1434R0 seemingly trying to introduce new undefined behavior, around integer-to-pointer conversions, where previously you had reasonably sensible implementation defined behavior (the conversions “are intended to be consistent with the addressing structure of the execution environment").

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p14...

Pointer provenance already existed before, but the standards were contradictory and incomplete. This is an effort to more rigorously nail down the semantics.

i.e., the UB already existed, but it was not explicit had to be inferred from the whole text and the boundaries were fuzzy. Remember that anything not explicitly defined by the standard, is implicitly undefined.

Also remember, just because you can legally construct a pointer it doesn't mean it is safe to dereference.

  • The current standard still says integer-to-pointer conversions are implementation defined (not undefined) and furthermore "intended to be consistent with the addressing structure of the execution environment" (that's a direct quote).

    I have an execution environment, Wasm, where doing this is pretty well defined, in fact. So if I want to read the memory at address 12345, which is within bounds of the linear memory (and there's a builtin to make sure), why should it be undefined behavior?

    And regarding pointer provenance, why should going through a pointer-to-integer and integer-to-pointer conversions try to preserve provenance at all, and be undefined behavior in situations where that provenance is ambiguous?

    The reason I'm using integer (rather than pointer) arithmetic is precisely so I don't have to be bound by pointer arithmetic rules. What good purpose does it serve for this to be undefined (rather than implementation defined) beyond preventing certain programs to be meaningfully written at all?

    I'm genuinely curious.

    • It is important to understand why undefined behaviour has proliferated over the past ~25 years. Compiler developers are (like the rest of us) under pressure to improve metrics like the performance of compiled code. Often enough that's because a CPU vendor is the one paying for the work and has a particular target they need to reach at time of product launch, or there's a new optimization being implemented that has to be justified as showing a benefit on existing code.

      The performance of compilers is frequently measured using the SPEC series of CPU benchmarks, and one of the main constraints of the series SPEC series of tests is that the source code of the benchmark cannot be changed. It is static.

      As a result, compiler authors have to find increasingly convoluted ways to make it possible for various new compiler optimizations to be applied to the legacy code used in SPEC. Take 403.gcc: it's based on gcc version 3.2 which was released on August 14th 2002 -- nearly 23 years ago.

      By making certain code patterns undefined behaviour, compiler developers are able to relax the constraints and allow various optimizations to be applied to legacy code in places which would not otherwise be possible. I believe the gcc optimization to eliminate NULL pointer checks when the pointer is dereferenced was motivated by such a scenario.

      In the real world code tends to get updated when compilers are updated, or when performance optimizations are made, so there is no need for excessive compiler "heroics" to weasel its way into making optimizations apply via undefined behaviour. So long as SPEC is used to measure compiler performance using static and unchanging legacy code, we will continue to see compiler developers committing undefined behaviour madness.

      The only way around this is for non-compiler developer folks to force language standards to prevent compilers from using undefined behaviour to do that which normal software developers considers to be utterly insane code transformations.

    • I fully agree with your analysis but compilers writers did think the could bend the rules, hence it was necessary to clarify that pointer-to-integer casts do work as intended. This still not in ISO C 23 btw because some compiler vendors did argue against it. But it is a TS now. If you are, please file bugs against your compilers.

  • Pointer provenance was certainly not here in the 80s. That's a more modern creation seeking to extract better performance from some applications at a cost of making others broken/unimplementable.

    It's not something that exists in the hardware. It's also not a good idea, though trying to steer people away from it proved beyond my politics.

    • Pointer provenance probably dates back to the 70s, although not under that name.

      The essential idea of pointer provenance is that it is somehow possible to enumerate all of the uses of a memory location (in a potentially very limited scope). By the time you need to introduce something like "volatile" to indicate to the compiler that there are unknown uses of a variable, you have to concede the point that the compiler needs to be able to track all the known uses within a compiler--and that process, of figuring out known uses, is pointer provenance.

      As for optimizations, the primary optimization impacted by pointer provenance is... moving variables from stack memory to registers. It's basically a prerequisite for doing any optimization.

      The thing is that traditionally, the pointer provenance model of compilers is generally a hand-wavey "trace dataflow back to the object address's source", which breaks down in that optimizers haven't maintained source-level data dependency for a few decades now. This hasn't been much of a problem in practice, because breaking data dependencies largely requires you to have pointers that have the same address, and you don't really run into a situation where you have two objects at the same address and you're playing around with pointers to their objects in a way that might cause the compiler to break the dependency, at least outside of contrived examples.

    • I'm not a compiler writer, but I don't know how you would be able to implement any optimization while allowing arbitrary pointer forging and without whole-program analysis.

      8 replies →

    • > It's not something that exists in the hardware

      This is sort of on the one hand not a meaningful claim, and then on the other hand not even really true if you squint anyway?

      Firstly the hardware does not have pointers. It has addresses, and those really are integers. Rust's addr() method on pointers gets you just an address, for whatever that's worth to you, you could write it to a log maybe if you like ?

      But the Morello hardware demonstrates CHERI, an ARM feature in which a pointer has some associated information that's not the address, a sort of hardware provenance.

    • It very much is something that exists in hardware. One of the major reasons why people finally discovered the provenance UB lurking in the standard is because of the CHERI architecture.

      1 reply →