← Back to context

Comment by sylware

2 days ago

Another pane of AMD GPU R&D is the _userland_ _hardware_ [ring] buffers for near direct hardware userland programming.

They started to experiment on that in mesa and linux ("user queues", as "user hardware queues").

I don't know how they will work around the scarse VM IDs, but here, we are talking near 0 driver. Obviously, they will have to simplify/cleanup a lot 3D pipeline programming and be very sure of its robustness, basically to have it ready for "default" rendering/usage right away.

Userland will get from the kernel stuff along those lines: command/event hardware ring buffers, data dma buffers, read/write pointers & doorbells memory page for those ring buffers, and an event file descriptor for an event ring buffer. Basically, what the kernel currently has.

I wonder if it will provide some significant simplification than the current way which is giving indirect command buffers to the kernel and deal with 'sync objects'/barriers.

The NVidia driver also has userland submission (in fact it does not support kernel-mode submission at all). I don't think it leads to a significant simplification or not of the userland code, basically a driver has to keep track of the same thing it would've submitted to an ioctl. If anything there are some subtleties that require careful consideration.

The major upside is removing the context switch on a submission. The idea is that an application only talks to the kernel for queue setup/teardown, everything else happens in userland.

  • Yep. Future of GPU hardware programming? The one we will have to "standard"-ized à la RISC-V for CPUs?

    The thing are the vulkan "fences", namely the GPU to CPU notifications. Probably hardware interrupts which will have to be forwarded by the kernel to the userland for an event ring buffer (probably a specific event file descriptor). There are alternatives though: we could think of userland polling/spinning on some cpu-mapped device memory content for notification or we could go one "expensive" step further which would "efficiently" remove the kernel for good here but would lock a CPU core (should be fine nowdays with our many cores CPUs): something along the line of a MONITOR machine instruction, basically a CPU core would halt until some memory content is written, with the possibility for another CPU core to un-halt it (namely spurious un-halting is expected).

    Does nvidia handle their GPU to CPU notifications without the kernel too?

    • eewww... my bad, we would need a timeout on the CPU core locking go back to the kernel.

      Well, polling? erk... I guess a event file descriptor is in order, and that nvidia is doing the same.