Comment by IshKebab

2 months ago

I think having separate unaligned load/store instructions would be a much worse design, not least because they use a lot of the opcode space. I don't understand why you don't just have an option to not generate misaligned loads for people that happen to be running on CPUs where it's really slow. You don't need to wait for a profile for that.

As for `seed`, if you're running on a microcontroller you can just look up the data sheet to see if it's seed entropy is sufficient. By the time you get to CPUs where portable code is important a CSPRNG is probably fine.

I agree about page size though. Svnapot seems overly complicated and gives only a fraction of the advantages of actually bigger pages.

>As for `seed`, if you're running on a microcontroller you can just look up the data sheet to see if it's seed entropy is sufficient.

It's a terrible attitude to have towards programmers, but looking at misaligned ops, I guess we can see a pattern from RISC-V authors here.

Most programmers do not target a concrete microcontroller and develop every line of code from scratch. They either develop portable libraries (e.g. https://docs.rs/getrandom) or build their projects using those libraries.

The whole raison d'être of an ISA is to provide a portable contract between hardware vendors and programmers . RISC-V authors shirk this responsibility with "just look at your micro specs, lol" attitude.

The option to generate or not generate misaligned loads/stores does exist (-mno-strict-align / -mstrict-align). But of course that's a compile-time option, and of course the preferred state would be to have use of them on by default, but RVA23 doesn't sufficiently guarantee/encourage them not being unreasonably-slow, leaving native misaligned loads/stores still effectively-unusable (and off by default on clang/gcc on -march=rva23u64).

aka, Zicclsm / RVA23 are entirely-useless as far as actually getting to make use of native misaligned loads/stores goes.

  • The cursed thing is that RVA23 does basically guarantees that `vle8.v` + `vmv.x.s` on misaligned addresses is fast.

    • Yeah, that is quite funky; and indeed gcc does that. Relatedly, super-annoying is that `vle64.v` & co could then also make use of that same hardware, but that's not guaranteed. (I suppose there could be awful hardware that does vle8.v via single-byte loads, which wouldn't translate to vle64.v?)

  • > RVA23 doesn't guatantee them not being unreasonably-slow

    Right but it doesn't guarantee that anything is unreasonably slow does it? I am free to make an RVA23 compliant CPU with a div instruction that takes 10k cycles. Does that mean LLVM won't output div? At some point you're left with either -mcpu=<specific cpu> and falling back to reasonable assumptions about the actual hardware landscape.

    Do ARM or x86 make any guarantees about the performance of misaligned loads/stores? I couldn't find anything.

    • Exactly, I 100% agree, and IMO toolchains should default to assuming fast misaligned load/store for RISC-V.

      However, the spec has the explicit note:

      > Even though mandated, misaligned loads and stores might execute extremely slowly. Standard software distributions should assume their existence only for correctness, not for performance.

      Which was a mistake. As you said any instruction could be arbitrarily slow, and in other aspects where performance recommendations could actually be useful RVI usually says "we can't mandate implementation".

    • I don't think x86/ARM particularly guarantee fastness, but at least they effectively encourage making use of them via their contributions to compilers that do. They also don't really need to given that they mostly control who can make hardware anyway. (at the very least, if general-purpose HW with horribly-slow misaligned loads/stores came out from them, people would laugh at it, and assume/hope that that's because of some silicon defect requiring chicken-bit-ing it off, instead of just not bothering to implement it)

      Indeed one can make any instruction take basically-forever, but I think it's a fairly reasonable expectation that all supported hardware instructions/behaviors (at least non-deprecated ones) are not slower than a software implementation (on at least some inputs), else having said instruction is strictly-redundant.

      And if any significant general-purpose hardware actually did a 10k-cycle div around the time the respective compiler defaults were decided, I think there's a good chance that software would have defaulted to calling division through a function such that an implementation can be picked depending on the running hardware. (let's ignore whether 10k-cycle-division and general-purpose-hardware would ever go together... but misaligned-mem-ops+general-purpose-hardware definitely do)

      15 replies →

RISC-V is not particularly good at using opcode space, unfortunately.

  • I don't think it's too bad. The compressed extension was arguably a mistake (and shouldn't be in RVA23 IMO), but apart from that there aren't any major blunders. You're probably thinking about how JAL(R) basically always uses x1/x5 (or whatever it is), but I don't think that's a huge deal.

    About 1/3 of the opcode space is used currently so there's a decent amount of space left.