Comment by winwang

1 year ago

Personal opinion: it's the software (and software tooling).

The hardware is good enough (even if we're only talking 10x efficiency). Part of the issue seems slightly cultural, i.e. repetitively putting down the idea of traditional task parallelism (not-super-SIMD/data parallelism) on GPUs. Obviously, one would lose a lot of efficiency if we literally ran 1 thread per warp. But it could be useful for lightly-data-parallel tasks (like typical CPU vectorization), or maybe using warp-wide semantics to implement something like a "software" microcode engine. Dumb example: implementing division with long division using multiplications and shifts.

Other things a GPU gives: insanely high memory bandwidth, programmable cache (shared memory), and (relatively) great atomic operations.