Comment by vincent-manis
2 days ago
This is why the old-fashioned university course on assembly language is still useful. Writing assembly language (preferably for a less-complex architecture, so the student doesn't get bogged down on minutiae) gives one a gut feeling for how machines work. Running the program on a simulator that optionally pays attention to pipeline and cache misses can help a person understand these issues.
It doesn't matter what architecture one studies, or even a hypothetical one. The last significant application I wrote in assembler was for System/370, some 40 years ago. Yet CPU ISAs of today are not really that different, conceptually.
> Yet CPU ISAs of today are not really that different, conceptually.
CPU true.
GPU no. It's not even the instructions that are different, but I would suggest studying up on GPU loads/stores.
GPUs have fundamentally altered how loads/stores have worked. Yes it's a SIMD load (aka gather operation) which has been around since the 80s. But the routing of that data includes highly optimized broadcast patterns and or butterfly routing or crossbars (which allows for an arbitrary shuffle within log2(n)).
Load(same memory location) across GPU Threads (or SimD lanes) compiles as a single broadcast.
Load(consecutive memory location) across consecutive SIMD lanes is also efficient.
Load(arbitrary) is doable but slower. The crossbar will be taxed.
Do you have any good resources that go into detail on GPU ISAs or GPU architecture? There's certainly a lot available for CPUs, but the resources I’ve found for GPUs mostly focus on how they differ from CPUs and how their ISAs are tailored to the GPU's specific goals.
Unfortunately this is a topic that isn't open enough, and architectures change rather quickly so you're always chasing the rabbit. That being said:
RDNA architecture (a few gens old) slides has some breadcrumbs: https://gpuopen.com/download/RDNA_Architecture_public.pdf
AMD also publishes its ISAs, but I don't think you'll be able to extract much from a reference-style document: https://gpuopen.com/amd-gpu-architecture-programming-documen...
Books on CUDA/HIP also go into some detail of the underlying architecture. Some slides from NV:
https://gfxcourses.stanford.edu/cs149/fall21content/media/gp...
Edit: I should say that Apple also publishes decent stuff. See the link here and the stuff linked at the bottom of the page. But note that now you're in UMA/TBDR territory; discrete GPUs work considerably differently: https://developer.apple.com/videos/play/wwdc2020/10602/
If anyone has more suggestions, please share.
I assume most people learn microarchitecture for performance reasons.
At which point, the question you are really asking is what aspects of assembly are important for performance.
Answer: there are multiple GPU Matrix Multiplication examples covering channels (especially channel conflicts), load/store alignment, memory movement and more. That should cover the issue I talked about earlier.
Optimization guides help. I know it's 10+ years old, but I think AMDs OpenCL optimization guides was easy to read and follow, and still modern enough to cover most of today's architectures.
Beyond that, you'll have to see conferences about DirectX12 new instructions (wave instructions, ballot/voting, etc. etc) and their performance implications.
It's a mixed bag, everyone knows one or two ways of optimization but learning all of them requires lots of study.
Branch Education apparently decapped and scanned a GA102 (Nvidia 30 series) for the following video: https://www.youtube.com/watch?v=h9Z4oGN89MU. The beginning is very basic, but the content ramps up quickly.
ISAs have not changed, sure. Microarchitectures are completely different and basically no school is going to teach you anything useful for that.
I don't think we had out of order designs with speculative execution 40 years ago? That seems like a pretty huge change.
These are mostly internal implementation details, instructions still appear to resolve in order from the outside (with some subtle exceptions for memory read/writes depending on the CPU architecture). It may become important to know such details for performance profiling though.
What has drastically changed is that you cannot do trivial 'cycle counting' anymore.
Not to step on your toes, but it shall be said that instructions in a CPU "retire" in order.
2 replies →
Did you teach the UBC CS systems programming course in 1985?
Intro CompE class does a good bit for mechanical sympathy as well.