Comment by rbanffy
2 months ago
Don't blame the ISA - blame the silicon implementations AND the software with no architecture-specific optimisations.
RISC-V will get there, eventually.
I remember that ARM started as a speed demon with conscious power consumption, then was surpassed by x86s and PPCs on desktops and moved to embedded, where it shone by being very frugal with power, only to now be leaving the embedded space with implementations optimised for speed more than power.
In some cases RISC-V ISA spec is definitely the one to blame:
1) https://github.com/llvm/llvm-project/issues/150263
2) https://github.com/llvm/llvm-project/issues/141488
Another example is hard-coded 4 KiB page size which effectively kneecaps ISA when compared against ARM.
All of those things are solved with modern extensions. It's like comparing pre-MMX x86 code with modern x86. Misaligned loads and stores are Zicclsm, bit manipulation is Zb[abcs], atomic memory operations are made mandatory in Ziccamoa.
All of these extensions are mandatory in the RVA22 and RVA23 profiles and so will be implemented on any up to date RISC-V core. It's definitely worth setting your compiler target appropriately before making comparisons.
Ubuntu being RVA23 is looking smarter and smarter.
The RISC-V ecosystem being handicapped by backwards compatibility does not make sense at this point.
Every new RISC-V board is going to be RVA23 capable. Now is the time to draw a line in the sand.
1 reply →
But RISC-V is a _new_ ISA. Why did we start out with the wrong design that now needs a bunch of extensions? RISC-V should have taken the learnings from x86 and ARM but instead they seem to be committing the same mistakes.
21 replies →
>Misaligned loads and stores are Zicclsm
Nope. See https://github.com/llvm/llvm-project/issues/110454 which was linked in the first issue. The spec authors have managed to made a mess even here.
Now they want to introduce yet another (sic!) extension Oilsm... It maaaaaay become part of RVA30, so in the best case scenario it will be decades before we will be able to rely on it widely (especially considering that RVA23 is likely to become heavily entrenched as "the default").
IMO the spec authors should've mandated that the base load/store instructions work only with aligned pointers and introduced misaligned instructions in a separate early extension. (After all, passing a misaligned pointer where your code does not expect it is a correctness issue.) But I would've been fine as well if they mandated that misaligned pointers should be always accepted. Instead we have to deal the terrible middle ground.
>atomic memory operations are made mandatory in Ziccamoa
In other words, forget about potential performance advantages of load-link/store-conditional instructions. `compare_exchange` and `compare_exchange_weak` will always compile into the same instructions.
And I guess you are fine with the page size part. I know there are huge-page-like proposals, but they do not resolve the fundamental issue.
I have other minor performance-related nits such `seed` CSR being allowed to produce poor quality entropy which means that we have bring a whole CSPRNG if we want to generate a cryptographic key or nonce on a low-powered micro-controller.
By no means I consider myself a RISC-V expert, if anything my familiarity with the ISA as a systems language programmer is quite shallow, but the number of accumulated disappointments even from such shallow familiarity has cooled my enthusiasm for RISC-V quite significantly.
49 replies →
What about page size?
7 replies →
You're correct but I guess my thoughts are if we're going to wind up with a mess of extensions, why not just use x86-64?
25 replies →
Regarding misaligned reads, IIRC only x86 hides non-aligned memory access. It's still slower than aligned reads. Other processors just fault, so it would make sense to do the same on riscv.
The problem is decades of software being written on a chip that from the outside appears not to care.
ARM Cortex-A cores also allow unaligned access (MCU cores don't though, and older ARM is weird). There's perhaps a hint if the two most popular CPU architectures have ended up in the forgiving approach to unaligned access, rather than the penalising approach of raising an interrupt.
1 reply →
Yes, unaligned loads/stores are a niche feature that has huge implications in processor design - loads across cache-lines with different residency, pages that fault etc.
This is the classic conundrum of legacy system redesign - if customers keep demanding every feature of the old system be present, and work the exact same then the new system will take on the baggage it was designed to get rid of.
The new implementation will be slow and buggy by this standard and nobody will use it.
3 replies →
On modern CPUs, it used not to be something to care about in the past across 8, 16, 32 bit generations, outside RISC.
3 replies →
Also the bit manipulation extension wasn't part of the core. So things like bit rotation is slow for no good reason, if you want portable code. Why? Who knows.
> Also the bit manipulation extension wasn't part of the core.
This is primarily because core is primarily a teaching ISA. One of the best parts about RiscV is that you can teach a freshman level architecture class or a senior level chip building project with an ISA that is actually used. Anything powerful to run (a non built from source manually) linux will support a profile that bundles all the commonly needed instructions to be fast.
30 replies →
The fact the Hazard3 designer ended up creating an extension to resolve related oddities was kind of astonishing.
Why did it fall to them to do it? Impressive that he did, but it shouldn't have been necessary.
3 replies →
Do you typically care about portability to the degree that you want the same machine code to execute on both a Linux box and a microcontroller? Why?
Unaligned load/store is a horrible feature to implement.
Page size can be easily extended down the line without breaking changes.
The first one is common across many architectures, including ARM, and the second is just LLVM developers not understanding how cmpxchg works
> 1) https://github.com/llvm/llvm-project/issues/150263
Huh? They have no idea what they are doing. If data is unaligned, the solution is memcpy, not compiler optimizations, also their hack of 17 loads is buffer overflow. Also not ISA spec problem.
> RISC-V will get there, eventually.
Not trolling: I legitimately don't see why this is assumed to be true. It is one of those things that is true only once it has been achieved. Otherwise we would be able to create super high performance Sparc or SuperH processors, and we don't.
As you note, Arm once was fast, then slow, then fast. RISC-V has never actually been fast. It has enabled surprisingly good implementations by small numbers of people, but competing at the high end (mobile, desktop or server) it is not.
I think the bigger question is does RISC-V need to be fast? Who wants to make it fast?
I'm a chip designer and I see people using RISC-V as small processor cores for things like PCIE link training or various bookkeeping tasks. These don't need to be fast, they need to be small and low power which means they will be relatively slow.
Most people on tech review sites only care about desktop / laptop / server performance. They may know about some of the ARM Cortex A series CPUs that have MMUs and can run desktop or smartphone Linux versions.
They generally don't care about the ARM Cortex M or R versions for embedded and real time use. Those are the areas where you don't need high performance and where RISC-V is already replacing ARM.
EDIT:
I'll add that there are companies that COULD make a fast RISC-V implementation.
Intel, AMD, Apple, Qualcomm, or Nvidia could redirect their existing teams to design a high performance RISC-V CPU. But why should they? They are heavily invested in their existing x86 and ARM CPU lines. Amazon and Google are using licensed ARM cores in their server CPUs.
What is the incentive for any of them to make a high performance RISC-V CPU? The only reason I can think of is that Softbank keeps raising ARM licensing costs and it gets high enough that it is more profitable to hire a team and design your own RISC-V CPU.
Of your list, Qualcomm and Nvidia are fairly likely to make high perf Riscv cpus. Qualcomm because Arm sued them to try and stop them from designing their own arm chips without paying a lot more money, and Nvidia because they already have a lot of teams making riscv chips, so it seems likely that they will try to unify on the one that doesn't require licensing.
3 replies →
China is likely where it would come from - ARM and x86 are owned by Western companies.
> I think the bigger question is does RISC-V need to be fast? Who wants to make it fast?
Honestly, the initial reaction is it sounds like cope, and I know this because I've been saying it for ages to angry reactions. RISC-V looks for all the world like it is designed for competing with the 32 bit Arm ecosystem but that the designers didn't, and still don't, understand what 64 bit Arm is about.
Secondly, it's been necessary to claim such things are forever on the way in order to maintain hype and get software support. Without it you wouldn't see nearly so much Linux buildchain work. (See the open source SuperH implementations for what happens if you admit you don't go for high performance).
Finally though, as process nodes get smaller you can afford to put much more complex blocks in the same area, which can then burst through a series of operations and power off again, many times a second. (Edit to add: of course you know that, but it's still counter intuitive the extent to which it changes things over time. People have things like floating point support in places that not too long ago would have been completely minimalist, and there are some really extreme examples around).
> I'll add that there are companies that COULD make a fast RISC-V implementation.
Again, there is no proof of this until it actually happens. When Qualcomm were trying they wanted to change the spec of RISC-V, and I strongly suspect that is actually necessary.
RISC-V doesn't have the pitfalls of Sparc (register windows, branch delay slots), largely because we learned from that. It's in fact a very "boring" architecture. There's no one that expects it'll be hard to optimize for. There are at least 2 designs that have taped out in small runs and have high end performance.
RISC-V does not have the pitfalls of experimental ISAs from 45 years ago, but it has other pitfalls that have not existed in almost any ISA since the first vacuum-tube computers, like the lack of means for integer overflow detection and the lack of indexed addressing.
Especially the lack of integer overflow detection is a choice of great stupidity, for which there exists no excuse.
Detecting integer overflow in hardware is extremely cheap, its cost is absolutely negligible. On the other hand, detecting integer overflow in software is extremely expensive, increasing both the program size and the execution time considerably, because each arithmetic operation must be replaced by multiple operations.
Because of the unacceptable cost, normal RISC-V programs choose to ignore the risk of overflows, which makes them unreliable.
The highest performance implementations of RISC-V from previous years were forced to introduce custom extensions for indexed addressing, but those used inefficient encodings, because something like indexed addressing must be in the base ISA, not in an extension.
35 replies →
As a counterexample, I point to another relatively boring RISC, PA-RISC. It took off not (just) because the architecture was straightforward, but because HP poured cash into making it quick, and PA-RISC continued to be a very competitive architecture until the mass insanity of Itanic arrived. I don't see RISC-V vendors making that level of investment, either because they won't (selling to cheap markets) or can't (no capacity or funding), and a cynical take would say they hide them behind NDAs so no one can look behind the curtain.
I know this is a very negative take. I don't try to hide my pro-Power ISA bias, but that doesn't mean I wouldn't like another choice. So far, however, I've been repeatedly disappointed by RISC-V. It's always "five or six years" from getting there.
8 replies →
> RISC-V doesn't have the pitfalls of Sparc (register windows, branch delay slots),
You're saying ISA design does have implementation performance implications then? ;)
> There's no one that expects it'll be hard to optimize for
[Raises hand]
> There are at least 2 designs that have taped out in small runs and have high end performance.
Are these public?
Edit: I should add, I'm well aware of the cultural mismatch between HN and the semi industry, and have been caught in it more than a few times, but I also know the semi industry well enough to not trust anything they say. (Everything from well meaning but optimistic through to outright malicious depending on the company).
3 replies →
I don't think anybody suggests Oracle couldn't make faster SPARC processors, it's just that development of SPARC ended almost 10 years ago. At the time SPARC was abandoned, it was very competitive.
In single-threaded performance? That’s not how I remember it: Sun was pushing parallel throughput over everything else, with designs like the T-Series & Rock.
2 replies →
Sparc stopped being competitive in the early 2000’s.
Because today, getting a fast CPU out it isn't as much an engineering issue as it is about getting the investment for hiring a world-class fab.
The most promising RISC-V companies today have not set out to compete directly with Intel, AMD, Apple or Samsung, but are targeting a niche such as AI, HPC and/or high-end embedded such as automotive.
And you can bet that Qualcomm has RISC-V designs in-house, but only making ARM chips right now because ARM is where the market for smartphone and desktop SoCs is. Once Google starts allowing RVA23 on Android / ChromeOS, the flood gates will open.
It's very much both. You need millions of dollars for the fab, but you also need ~5 years to get 3 generations of cpus out (to fix all the performance bugs you find in the first two)
Fast, RVA23-compatible microarchitectures already exist. Everything high performance seems to be based on RVA23, which is the current application profile and comparable to ARMv9 and x86-64v4.
However, it takes time from microarchitecture to chips, and from chips to products on shelves.
The very first RVA23-compatible chips to show up will likely be the spacemiT K3 SoC, due in development boards April (i.e. next month).
More of them, more performant, such as a development board with the Tenstorrent Ascalon CPU in the form of the Atlantis SoC, which was tapped out recently, are coming this summer.
It is even possible such designs will show up in products aimed at the general public within the present year.
> Don't blame the ISA - blame the silicon implementations
That's true, but tautological.
The issue is that the RISC-V core is the easy part of the problem, and nobody seems to even be able to generate a chip that gets that right without weirdness and quirks.
The more fundamental technical problem is that things like the cache organization and DDR interface and PCI interface and ... cannot just be synthesized. They require analog/RF VLSI designers doing things like clock forwarding and signal integrity analysis. If you get them wrong, your performance tanks, and, so far, everybody has gotten them wrong in various ways.
The business problem is the fact that everybody wants to be the "performance" RISC-V vendor, but nobody wants to be the "embedded" RISC-V vendor. This is a problem because practically anybody who is willing to cough up for a "performance" processor is almost completely insensitive to any cost premium that ARM demands. The embedded space is hugely sensitive to cost, but nobody is willing to step into it because that requires that you do icky ecosystem things like marketing, software, debugging tools, inventory distribution, etc.
This leads to the US business problem which is the fact that everybody wants to be an IP vendor and nobody wants to ship a damn chip. Consequently, if I want actual RISC-V hardware, I'm stuck dealing with Chinese vendors of various levels of dodginess.
A pattern I've noticed for a very long time:
A lot of times the path to the highest performing CPU seems to be to optimize for power first, then speed, then repeat. That's because power and heat are a major design constraint that limits speed.
I first noticed this way back with the Pentium 4 "Netburst" architecture vs. the smaller x86 cores that became the ancestor of the Core architecture. Intel eventually ran into a wall with P4 and then branched high performance cores off those lower-power ones and that's what gave us the venerable Core architecture that made Intel the dominant CPU maker for over a decade.
ARM's history is another example.
I think the story is a bit more complicated. Core succeeded precisely because Intel had both the low-power experience with Pentium-M and the high-power experience with Netburst. The P4 architecture told them a lot about what was and wasn't viable and at what complexity. When you look at the successor generations from Core, what you see are a lot of more complex P4-like features being re-added, but with the benefits of improved microarch and fab processes. Obviously we will never know, but I don't think you would get to Haswell or Skylake in the form they were without the learning experience of the P4.
In comparison, I think Arm is actually a very strong cautionary tale that focusing on power will not get you to performance. Arm processors remained pretty poor performance until designers from other CPU families entirely (PowerPC and Intel) took it on at Apple and basically dragged Arm to the performance level they are today.
> In comparison, I think Arm is actually a very strong cautionary tale that focusing on power will not get you to performance.
Hugely underappreciated. Someone involved fully understood that "you don't get to the moon by climbing progressively taller trees".
The other two times Arm had great performance were the StrongArm, when it was implemented by DEC people off the Alpha project, and the initial ones, which were quite esoteric and unusually suited to the situation of the late 80s.
And not just any PowerPC architects either, but the people from PA Semi. Motorola couldn't get the speed up and IBM couldn't get the power down.
NetBurst was supposed to be the application of RISC principles to x86 taken to its extreme (ultra-long pipelines to reduce clock-to-clock delay, highest clock speed possible --- basically reducing work-per-clock and hoping that reduces complexity enough to increase clock speed to compensate.) The ALU was 16 bits, "double pumped" with the carry split between the two, which lead to 32-bit ALU operations that don't carry between the lower and upper halves actually finishing a clock cycle faster than those with a carry.
https://stackoverflow.com/questions/45066299/was-there-a-p4-...
Core evolved from the Banis (Centrino) CPU core which was based on P3, not P4. Banias used the front-side bus from P4 but not the cores.
Banias was hyper optimized for power, the mantra was to get done quickly and go to sleep to save power. Somewhere along the line someone said "hey what happens if we don't go to sleep?" and Core was born.
I don’t have a micro architecture background so I apologize if this is obvious — What do power and speed mean in this context?
Power - how many Watts does it need? Speed - how quickly can it perform operations?
2 replies →
One could say "Optimize for efficiency first, then performance".
Parallels to code design, where optimizing data or code size can end up having fantastic performance benefits (sometimes).
There's the ARM video from LowSpecGamer, where they talk about how they forgot to connect power to the chip, and it was still executing code anyway. According to Steve Furber, the chip was accidentally being powered from the protection diodes alone. So ARM was incredibly power efficient from the very beginning.
Marcin is working with us on RISC-V enablement for Fedora and RHEL, he's well aware of the problem with current implementations. We're hopeful that this'll be pretty much resolved by the end of the year.
If he expects it to be resolved by the end of the year (and I agree it likely will be), why is he writing a post like this?
Is this because Fedora 44 is going to beta?
Because I can.
Is it good enough answer?
> AND the software with no architecture-specific optimisations
The optimizations that'd be applied to ARM and MIPS would be equally applicable to RISC-V. I do not believe this is a lack of software optimization issue.
We are well past the days where hand written assembly gives much benefit, and modern compilers like gcc and llvm do nearly identical work right up until it comes to instruction emissions (including determining where SIMD instructions could be placed).
Unless these chips have very very weird performance characteristics (like the weirdness around x86's lea instruction being used for arithmetic) there's just not going to be a lot of missed heuristics.
> The optimizations that'd be applied to ARM and MIPS would be equally applicable to RISC-V.
There's no carry bit, and no widening multiply(or MAC)
RISC-V splits widening multiply out into two instructions: one for the high bits and one for the low. Just like 64-bit ARM does.
Integer MAC doesn't exist, and is also hindered by a design decision not to require more than two source operands, so as to allow simple implementations to stay simple. The same reason also prevents RISC-V from having a true conditional move instruction: there is one but the second operand is hard-coded zero.
FMAC exists, but only because it is in the IEEE 754 spec ... and it requires significant op-code space.
[flagged]
The things you are talking about are taken care of by out of order execution and the CPU itself being smart about how it executes. Putting in prefetch instructions rarely beats the actual prefetcher itself. Compilers didn't end up generating perfect pentium asm either. OOO execution is what changed the game in not needing perfect compiler output any more.
While true, it's typically not going to be impactful on system performance.
There's a reason, for example, why the linux distros all target a generic x86 architecture rather than a specific architecture.
6 replies →
IF you care to read the article, they indeed do not blame the architecture but the available silicon implementations.
I did read it. A Banana Pi is not the fastest developer platform. The title is misleading.
BTW, it's quite impressive how the s390x is so fast per core compared to the others. I mean, of course it's fast - we all knew that.
And don't let IBM legal see this can be considered a published benchmark, because they are very shy about s390x performance numbers.
> A Banana Pi is not the fastest developer platform.
What is the current fastest platform that isn’t exorbitantly expensive? Not upcoming releases, but something I can actually buy.
I check in every 3-6 months but the situation hasn’t changed significantly yet.
2 replies →
I was really surprised by the s390x performance, but I also don't really understand why there are build time listed by architecture, not the actual processors.
4 replies →
>I did read it. A Banana Pi is not the fastest developer platform. The title is misleading.
Ironically, its SoC (spacemiT K1) is slower than the JH7110 used in the first mass-produced RISC-V SBC, VisionFive 2.
But unlike JH7110, it has vector 1.0, making it a very popular target.
Of course, none of these pre-RVA23 boards will be relevant anymore, once the first development boards with RVA23-compatible K3 ship next month.
These are also much faster than anything RISC-V currently purchasable. Developers have been playing with them for months through ssh access.
Which risc-v implementation is considered fast?
6 replies →
I keep checking in on Tenstorrent every few months thinking Keller is going to rock our world... losing hope.
At this point the most likely place for truly competitive RISC-V to appear is China.
Tenstorrent is supposedly taping out 8-wide Ascalon processors as we speak, with devboards projected to be available in Q2/Q3 this year.
BTW. Keller is also on the board of AheadComputing — founded by former Intel engineers behind the fabled "Royal Core".
4 replies →
> At this point the most likely place for fast RISC-V to appear is China.
Or we just adopt Loongson.
8 replies →
But they didn't reflect that in a title like "current RISC-V silicon Is Sloooow" ...
Then how do you justify the title?
If you make a spec that the wider industry cannot effectively implement into quality products, it's the spec that's wrong. And that's true for anything - whether it's RISC-V, ipv6, Matter, USB-C and so on.
That's what makes writing specs hard - you need people who understand implementation challenges at the table, not dreaming architects and academics.
RISC-V lacks a bunch of really useful relatively easy to implement instructions and most extensions are truly optional so you can't rely on them. That's the problem if you let a bunch of academics turn your ISA into a paper mill.
In theory you can spend a lot of effort to make a flawed ISA perform, but it will be neither easy nor pretty e.g. real world Linux distros can't distribute optimised packages for every uarch from dual-issue in-order RV64GC to 8-wide OoO RV64 with all the bells and whistles. Only in (deeply) embedded systems can you retarget the toolchain and optimise for each damn architecture subset you encounter.
ARM was never a "speed demon"; it started out as a low power small-area core and clearly had more complexity and thought put into it than MIPS or RISC-V.
Over a decade ago: https://news.ycombinator.com/item?id=8235120
RISC-V will get there, eventually.
Strong doubt. Those of us who were around in the 90s might remember how much hype there was with MIPS.
I don’t think you remember, But the first Archimedes smoked the just-launched Compaq 386s with a dedicated 387 coprocessor.
It was not designed to be one, but it ended up being surprisingly fast.