Comment by geerlingguy

20 hours ago

I came into the post thinking it would be running a VM through the slow tinygrad driver... but this is much, much better.

It'd be amazing if Apple would provide better support, and allow more than that 1.5 GB window to make this easier. Arm overall has some quirks with PCIe devices, but at least in Linux, it's gotten so much easier since most modern drivers treat arm64 as a first class citizen.

i don't know for sure, but i suspect what makes the tinygrad stuff slow isn't the macos host driver itself. i think they're doing something very similar to what i'm doing, which is just mapping the PCI BARs to userspace, then they have a bunch of python code that drives the GPU.

this is only speculation, but i think the big thing that makes tinygrad slow is that the tinygrad inference engine has not really been optimized much for all these open LLM models. probably most of the work has gone towards optimizing the stack for george's self-driving hardware company. since you can't just run the existing CUDA kernels on their engine, that makes things a lot tougher, engineering-wise.

i am actually curious if my project could share a macos host driver with them. i think it would need some changes, but it seems like there's a lot of overlap