Comment by twothreeone

2 days ago

While shipping binary kernels may be a workaround for some users, it goes against what many people would consider "good etiquette" for various valid reasons, such as hackability, security, or providing free (as in liberty) software.

9 comments

twothreeone

rkangel 1 day ago

Shipping binary artifacts isn't inherently a bad thing - that's what (most) Linux Distros do after all! The important distinction is how that binary package was arrived at. If it's a mystery and the source isn't available then that's bad. If it's all in the open source repo and part of the Python package build and is completely reproducible then that's great.

benreesman 1 day ago
GP's solution is the one that I (and I think most people) ultimately wind up using. But it's not a nice time in the usual "oh we'll just compile it" sense of a typical package.
flash-attn in particular has its build so badly misconfigured and is so heavy that it will lock up a modern Zen 5 machine with 128GB of DDR5 if you don't re-nice ninja (assuming of course that you remembered it just won't work without a pip-visible ninja). It can't build a wheel (at least not obviously) that will work correctly on Ampere and Hopper both, it incorrectly declares it's dependencies so it will demand torch even if torch is in your pyproject.toml and you end up breaking build isolation.
So now you've got your gigabytes of fragile wheel that won't run on half your cards, let's make a wheel registry. Oh, and machine learning everything needs it: half of diffusers crashes at runtime without it. Runtime.
The dirty little secret of these 50MM offers at AI companies is that way more people understand the math (which is actually pretty light compared to say graduate physics) than can build and run NVIDIA wheels at scale. The wizards who Zuckerberg will fellate are people who know some math and can run Torch on a mixed Hopper/Blackwell fleet.
And this (I think) is Astral's endgame. I think pyx is going to fix this at scale and they're going to abruptly become more troublesome to NVIDIA than George Hotz or GamersNexus.
- agos 1 day ago
  
  Dumb question from an outsider - why do you think this is so bad? Is it because so much of the ML adjacent code is written by people with background in academia and data science instead of software engineering? Or is it just Python being bad at this?
  
  2 replies →

kouteiheika 1 day ago

It's not a workaround; it's the most sane way of shipping such software. As long as the builds are reproducible there's nothing wrong with shipping binaries by default, especially when those binaries require non-trivial dependencies (the whole CUDA toolchain) to build.

There's a reason why even among the most diehard Linux users very few run Gentoo and compile their whole system from scratch.

benreesman 1 day ago
I agree with you that binary distribution is a perfectly reasonable adjunct to source distribution and sometimes even the more sensible one (toolchain size, etc).
In this instance the build is way nastier than building the NVIDIA toolchain (which Nix can do with a single line of configuration in most cases), and the binary artifacts are almost as broken as the source artifact because of NVIDIA tensor core generation shenanigans.
The real answer here is to fucking fork flash-attn and fix it. And it's on my list, but I'm working my way down the major C++ packages that all that stuff links to first. `libmodern-cpp` should be ready for GitHub in two or three months. `hypermodern-ai` is still mostly a domain name and some scripts, but they're the scripts I use in production, so it's coming.
- kouteiheika 1 day ago
  
  I thought about fixing Flash Attention too so that I don't have to recompile it every time I update Python or pytorch (it's the only special snowflake dependency that I need to manually handle), but at the end of the day it's not that much of a pain to justify the time investment.
  If I'm going to invest time here then I'd rather just write my own attention kernels and also do other things which Flash Attention currently doesn't do (8-bit and 4-bit attention variants similar to Sage Attention, and focus on supporting/optimizing primarily for GeForce and RTX Pro GPUs instead of datacenter GPUs which are unobtanium for normal people).
  
  1 reply →