Comment by koyote

18 hours ago

Are there any details around how the round-trip and exchange of data (CPU<->GPU) is implemented in order to not be a big (partially-hidden) performance hit?

e.g. this code seems like it would entirely run on the CPU?

    print!("Enter your name: ");
    let _ = std::io::stdout().flush();
    let mut name = String::new();
    std::io::stdin().read_line(&mut name).unwrap();

But what if we concatenated a number to the string that was calculated on the GPU or if we take a number:

    print!("Enter a number: ");
    [...] // string number has to be converted to a float and sent to the GPU
    // Some calculations with that number performed on the GPU
    print!("The result is: " + &the_result.to_string()); // Number needs to be sent back to the CPU

Or maybe I am misunderstanding how this is supposed to work?

"We leverage APIs like CUDA streams to avoid blocking the GPU while the host processes requests.", so I'm guessing it would let the other GPU threads go about their lives while that one waits for the ACK from the CPU.

I once wrote a prototype async IO runtime for GLSL (https://github.com/kig/glslscript), it used a shared memory buffer and spinlocks. The GPU would write "hey do this" into the IO buffer, then go about doing other stuff until it needed the results, and spinlock to wait for the results to arrive from the CPU. I remember this being a total pain, as you need to be aware of how PCIe DMA works on some level: having your spinlock int written to doesn't mean that the rest of the memory write has finished.

Why are you assuming that this is intended to be performant, compared to code that properly segregates the CPU- and GPU-side? It seems clear to me that the latter will be a win.

  • I am not assuming it to be performant, but if you use this in earnest and the implementation is naive, you'll quickly have a bad time from all the data being copied back and forth.

    In the end, people program for GPUs not because it's more fun (it's not!), but because they can get more performance out of it for their specific task.