← Back to context

Comment by jacquesm

9 days ago

I have a really neat idea to improve the message passing speed in QNX: you simply use the paging mechanism to send the message. That means there is no copying of the data at all, just a couple of page table updates. You still have the double TSS load overhead (vs 1 TSS load in a macro kernel), but that is pretty quick.

But you are right that there is a price for elegance. It becomes an easier choice to make when you factor in things like latency and long term reliability / stability / correctness. Those can weigh much heavier than mere throughput.

This is sort of what Mach does with "out-of-line" messages: https://web.mit.edu/darwin/src/modules/xnu/osfmk/man/mach_ms... https://dmcyk.xyz/post/xnu_ipc_iii_ool_data/

(this is used under-the-hood on macOS: NSXPCConnection -> libxpc -> MIG -> mach messages)

  • Mach has always been a very interesting project. It doesn't surprise me at all to see that they have this already, but at the same time I was not aware of it so thank you. This also more or less proves that that may well be an avenue worth pursuing.

    • I learned of the idea from some paper or other of Barrelfish, which is a research OS based on seL4. Barrelfish is underrated! Aside from its takes on a kernel architecture, it also has interesting nuggets on other aspects of OS design, such as using declarative techniques for device management.

I haven't seen it implemented anywhere, but that sounds like the "pagetable displacement" approach described here: https://wiki.osdev.org/IPC_Data_Copying_methods#Pagetable_di...

The same idea occurred to me a while ago too, which is how I originally found that link :)

  • How performant is that in practice? I thought setting pages was a fairly expensive process. Using a statically mapped circular buffer makes more sense to me at least.

    Disclaimer: I don't actually know what I'm talking about, lol

    • To be clear, since the other replies to you don't seem to be mentioning it, the major costs of MMU page-based virtual memory are never about setting the page metadata. In any instance of remapping, TLB shootdowns and subsequent misses hurt. Page remapping is still very useful for large buffers, and other costs can be controlled based on intended usage, but smaller buffers should use other methods.

      (Of course I'm being vague about the cutoff for "large" and "smaller" buffers. Always benchmark!)

    • You can pretty reliably do it on the order of 1 us on a modern desktop processor. If you use a level 2 sized mapping table entry of say 2 MB, that is a transfer speed on the order of 2 TB/s or ~32x faster than RAM for a single core even if you only move a single level 2 sized mapping table entry. If you transfer multiple in one go or use say a level 3 sized mapping table entry of 1 GB that would be 1 PB/s or ~16,000x faster than RAM or ~20x the full memory bandwidth of a entire H200 GPU.

    • Pretty quick, far faster than inter-process memory copy. The only way to be sure would be to set it up and to measure it, but on a 486/33 I could do this ~200K per second, on modern systems it should be a lot faster than that, more so if the processe(s) do not use FP. But I never actually tried setting up say a /dev/null implementation that used this, it would be an interesting experiment.

Passing the PTE sounds great for big messages (send/recv).

For small messages (open), the userspace malloc is going to have packed small buffers into a single page - so there's a chance you'd need to copy to a new userspace page, the two copies might work out better.

  • The throughput limitation is really only an issue for big messages, for smaller ones the processing overhead will dominate.

The QNX call to do that is mmap().

  • Yes, I know. But I rolled my own QNX clone and I figured it would be neat to do this transparently rather than that the application has to code it up explicitly. This puts some constraints on where messages can be located though and that's an interesting problem to solve if you want to do it entirely without overhead.

    • I have a general distaste for transparent policies, which I always find to fall short for some use case. In this case, the sender would know best what to do with their message. Moreover, for small buffers, page remapping won't be an optimization. I recommend reflecting this as an alternative send interface.

      The lower a transparent policy lies in the OS, the worse it contorts the system. Even mechanisms necessarily constrain policy, if only slightly. I strongly believe that microkernels will only be improved by adhering ever closer to true minimality. If backwards compatibility is important, put the policy in a library. But I think transparent policies are generally advisable only when user feedback indicates benefit.

      2 replies →