← Back to context

Comment by kragen

11 days ago

Latency is the killer, I think. A GC can be on the order of 100 instructions.

It’s a concurrent GC. Latency won’t kill you

I’ll admit that if you are in the business of counting instructions then other things in Fil-C will kill you. Most of the overhead is from pointer chasing.

See https://fil-c.org/invisicaps

  • "Concurrent" doesn't usually mean "bounded in worst-case execution time", especially on a uniprocessor. Does it in this case?

    InvisiCaps sound unbelievably amazing. Even CHERI hasn't managed to preserve pointer size.

    • > "Concurrent" doesn't usually mean "bounded in worst-case execution time", especially on a uniprocessor. Does it in this case?

      Meh. I was in the real time GC game for a while, when I was younger. Nobody agrees on what it really means to bound the worst case. If you're a flight software engineer, it means one thing. If you're a game developer, it means something else entirely. And if you're working on the audio stack specifically, it means yet another thing (somewhere in between game and flight).

      So let me put it this way, using the game-audio-flight framework:

      - Games: I bound worst case execution time, just assuming a fair enough OS scheduler, even on uniprocessor.

      - Audio: I bound worst case execution time if you have multiple cores.

      - Flight: I don't bound worst case execution time. Your plane crashes and everyone is dead

      1 reply →

    • > "Concurrent" doesn't usually mean "bounded in worst-case execution time"

      Sure, though this is also true for ordinary serial code, with all the intricate interactions between the OS scheduler, different caches, filesystem, networking, etc.

      4 replies →

  • For embedded use cases, it can definitely kill you. Small microcontrollers frequently have constant IPC for a given instruction stream and you regularly see simple for loops get used for timing.

  • There's tricks to improve the performance of pointer chasing on modern uarchs (cfr go's Greentea GC). You want to batch the address calculation/loading, deref/load and subsequent dependent ops like marking. Reorder buffers and load-store buffers are pretty big these days, so anything that breaks the addr->load->do dependency chain is a huge win, especially if there are any near that traverse loop.

In the fast case allocations can be vastly cheaper than malloc, usually just a pointer decrement and compare. You'll need to ensure that your fast path never has the need to collect the minor heap, which can be done if you're careful. I hate this comparison that is always done as if malloc/free are completely cost-free primitives.

  • I agree, and I've written an allocator in C that works that way. The fast path is about 5 clock cycles on common superscalar processors, which is about 7–10× faster than malloc: http://canonical.org/~kragen/sw/dev3/kmregion.h

    This is bottlenecked on memory access that is challenging to avoid in C. You could speed it up by at least 2× with some compiler support, and maybe even without it, but I haven't figured out how. Do you have any ideas?

    Typically, though, when you are trying to do WCET analysis, as you know, you try to avoid any dynamic allocation in the time-sensitive part of the program. After all, if completing a computation after a deadline would cause a motor to catch fire or something, you definitely don't want to abort the computation entirely with an out-of-memory exception!

    Some garbage collectors can satisfy this requirement just by not interfering with code that doesn't allocate, but typically not concurrent ones.