Comment by pizlonator

3 days ago

> like the union of different languages

No. Here are two good ways to think about it:

1. It's the C programming language represented as SSA form and with some of the UB in the C spec given a strict definition.

2. It's a low level representation. It's suitable for lowering other languages to. Theoretically, you could lower anything to it since it's Turing-complete. Practically, it's only suitable for lowering sufficiently statically-typed languages to it.

> Every tool and component in the LLVM universe had its own set of rules and requirements for the LLVM IR that it understands.

Definitely not. All of those tools have a shared understanding of what happens when LLVM executes on a particular target and data layout.

The only flexibility is that you're allowed to alter some of the semantics on a per-target and per-datalayout basis. Targets have limited power to change semantics (for example, they cannot change what "add" means). Data layout is its own IR, and that IR has its own semantics - and everything that deals with LLVM IR has to deal with the data layout "IR" and has to understand it the same way.

> My bewilderment about LLVM IR not being stable between versions had given way to understanding that this freedom was necessary.

Not parsing this statement very well, but bottom line: LLVM IR is remarkably stable because of Hyrum's law within the LLVM project's repository. There's a TON of code in LLVM that deals with LLVM IR. So, it's super hard to change even the smallest things about how LLVM IR works or what it means, because any such change would surely break at least one of the many things in the LLVM project's repo.

> 1. It's the C programming language represented as SSA form and with some of the UB in the C spec given a strict definition.

This is becoming steadily less true over time, as LLVM IR is growing somewhat more divorced from C/C++, but that's probably a good way to start thinking about it if you're comfortable with C's corner case semantics.

(In terms of frontends, I've seen "Rust needs/wants this" as much as Clang these days, and Flang and Julia are also pretty relevant for some things.)

There's currently a working group in LLVM on building better, LLVM-based semantics, and the current topic du jour of that WG is a byte type proposal.

  • > This is becoming steadily less true over time, as LLVM IR is growing somewhat more divorced from C/C++, but that's probably a good way to start thinking about it if you're comfortable with C's corner case semantics.

    First of all, you're right. I'm going to reply with amusing pedantry but I'm not really disagreeing

    I feel like in some ways LLVM is becoming more like C-in-SSA...

    > and the current topic du jour of that WG is a byte type proposal.

    That's a case of becoming more like C! C has pointer provenance and the idea that byte copies can copy "more" than just the 8 bits, somehow.

    (The C provenance proposal may be in a state where it's not officially part of the spec - I'm not sure exactly - but it's effectively part of the language in the sense that a lot of us already consider it to be part of the language.)

    • The C pointer provenance is still in TS form and is largely constructed by trying to retroactively justify the semantics of existing compilers (which all follow some form of pointer provenance, just not necessarily coherently). This is still an area where we have a decent idea of what we want the semantics to be but it's challenging to come up with a working formalization.

      I'd have to double-check, but my recollection is that the current TS doesn't actually require that you be able to implement user-written memcpy, rather it's just something that the authors threw their hands up and said "we hope compilers support this, but we can't specify how." In that sense, byte type is going beyond what C does.

      3 replies →

Thanks for your detailed answer. You encouraged me to give it another try and have closer look this time.