← Back to context

Comment by lionkor

2 days ago

You know this from? Any sources? I'd love to learn more because it would be one of the very few industries that still write assembly by hand extensively enough to warrant hiring experts on just that.

My source is that I work on this at a non-frontier lab and also I interviewed with that team

  • Okay that's fascinating. Can you share what kind of things require this? Where are compilers and extensive profiling not enough? Is it just very hot right loops, or larger routines? Is it for CPU or GPU?

    • Taking a step back: I think a lot of people have a misunderstanding of this space. Despite what the "coolest baddest hackers" on social media might have you believe, performance engineers are not thinking about assembly in that they are writing assembly by hand all day. They most certainly know how to do so, and sometimes they end up having to do it themselves, but the goal is always specific workloads and how to make them run as fast as possible, with as little work as possible. If I could have Claude take my model and spit out a perfectly fused kernel for it that I knew was correct and hit 99% MFU I would just use that (well, actually I would probably retire at that point).

      Until that happens this remains an unsolved problem, so my job is to take the description of what needs to be done and find which code is on the cold setup path and can just be some PyTorch or whatever the ML researchers can write themselves, and also which part of the algorithm is where all the FLOPs are. As things get more performance critical and run more, I look at the code closer and closer. In the core of the hottest kernels, where most of the work happens, I might be placing individual instructions by hand, or going even below that and thinking about cache behavior or power characteristics.

      A good performance engineer is capable of doing this while also being able to find places where they can automate this process. And there are a lot of things you can automate: layouts, schedules, pipelines. There's a lot of work going for compilers and profilers for all kinds of accelerators. Some of these operate on the "assembly" but there are all kinds of assembly. Some of these tools do almost everything for you; some are a very thin layer over the code they generate. You can see this in the interview that was linked above: it's an assembly optimization task, but you will get better results (in the time provided, at least) if you do compiler-like things. IIRC the assembler already operates on named values and in my submission I had extended the instruction selection algorithm to pack bundles based on hazards.