Comment by alexozer

9 months ago

So am I identifying the bottlenecks that motivate this design correctly?

1. Go FFI is slow

2. Per-proto generated code specialization is slow, because of icache pressure

I know there's more to the optimization story here, but I guess these are the primary motivations for the VM over just better code generation or implementing a parser in non-Go?

I know that Java resisted improving their FFI for years because they preferred that the JIT get the extra resources. And that customers not bail out of Java every time they couldn’t figure out how to make it faster. There’s a case I recall from when HotSpot was still young, where the Java GUI team moved part of the graphics pipeline to the FFI in one release, hotspot got faster in the next, and then they rolled back the changes because it was now faster without the FFI.

But eventually your compiler is good enough that the FFI Is now your bottleneck, and you need to do something.

3. The use case is dynamic schemas and access is through the reflection API. Thus PGO has to be done at runtime...

I keep hearing that Go's C FFI is slow, why is that? How much slower is it in comparison to other languages?

  • Go's goroutines aren't plain C threads (blocking syscalls are magically made async), and Go's stack isn't a normal C stack (it's tiny and grown dynamically).

    A C function won't know how to behave in Go's runtime environment, so to call a C function Go needs make itself look more like a C program, call the C function, and then restore its magic state.

    Other languages like C++, Rust, and Swift are similar enough to C that they can just call C functions directly. CPython is a C program, so it can too. Golang was brave enough to do fundamental things its own way, which isn't quite C-compatible.

    • > CPython is a C program

      Go (gc) was also a C program originally. It still had the same overhead back then as it does now. The implementation language is immaterial. How things are implemented is what is significant. Go (tinygo), being a different implementation, can call C functions as fast as C can.

      > ...so it can too.

      In my experience, the C FFI overhead in CPython is significantly higher than Go (gc). How are you managing to avoid it?

      3 replies →

    • I wonder if they should be using something like libuv to handle this. Instead of flipping state back and forth, create a playground for the C code that looks more like what it expects.

  • Go's threading model involves a lot of tiny (but growable) stacks and calling C functions almost immediately stack overflows.

    Calling C safely is then slow because you have to allocate a larger stack, copy data around and mess with the GC.

  • > How much slower is it in comparison to other languages?

    It's about the same as most other languages that aren't specifically optimized for C calling. Considerably faster than Python.

    Which is funny as everyone on HN loves to extol the virtues of Python being a "C DSL" and never think twice about its overhead, but as soon as the word Go is mentioned its like your computer is going to catch fire if you even try.

    Emotion-driven development is a bizarre world.

  • I've asked ChatGPT to summarize (granted my prompt might not be ideal), but some points to note, here just first in details others in the link at the bottom:

         Calling C from Go (or vice versa) often requires switching from Go's lightweight goroutine model to a full OS thread model because:
           - Go's scheduler manages goroutines on M:N threads, but C doesn't cooperate with Go's scheduler.
           - If C code blocks (e.g., on I/O or mutex), Go must assume the worst and parks the thread, spawning another to keep Go alive.
         * Cost: This means entering/exiting cgo is significantly more expensive than a normal Go call. There’s a syscall-like overhead.
    
    

    ... This was only the first issue, but then it follows with "Go runtime can't see inside C to know is it allocating, blocking, spinning, etc.", then "Stack switching", "Thread Affinity and TLS", "Debug/Profiling support overhead", "Memory Ownership and GC barriers"

    All here - https://chatgpt.com/share/688172c3-9fa4-800a-9b8f-e1252b57d0...

    • Just to roll with your way: https://chatgpt.com/share/688177c9-ebc0-8011-88cc-9514d8e167...

      Please do not take the numbers below at face value. I still expect an actual reply to my initial comment.

      Per-call overhead:

        C (baseline)    - ~30 ns
        Rust (unsafe)   - ~30 ns
        C# (P/Invoke)   - ~30-50 ns
        LuaJIT          - ~30-50 ns
        Go (cgo)        - ~40-60 ns
        Java (22, FFM)  - ~40-70 ns
        Java (JNI)      - ~300-1000 ns
        Perl (XS)       - ~500-1000 ns
        Python (ctypes) - ~10,000-30,000 ns
        Common Lisp (SBCL) - ~500-1500 ns
      

      Seems like Go is still fast enough as opposed to other programming languages with GC, so I am not sure it is fair to Go.

      13 replies →