Comment by KingOfCoders

8 months ago

The website claims "automatically achieves near-ideal speedup"

12x for 16x threads

51x for 16.000x threads

Can someone point me to a website where it explains that this is the "ideal speedup"? Is there a formula?

Bend is intriguing --

1. Some potentially useful perspectives:

* Weak scaling vs strong scaling: https://www.kth.se/blogs/pdc/2018/11/scalability-strong-and-... ?

* ... Strong scaling, especially comparing to a modern sequential baseline, seems to be where folks are noting the author still has some work to do wrt getting to ideal speedups for what performance people care about

* There are parallel models of computation like PRAM for describing asymptomatically idealized speedups of trickier aspects of parallel code like heap usage . Bend currently seems to do a lot of stack allocations that someone writing in most parallel systems wouldn't, and the asymptomatic slowdowns would show up in these models, eg, asymptotically many unnecessary heap/stack data movements. There are a lot of these models, which are useful for being precise when making ideal speedups claims. NUMA, network topology, etc. ("Assume everyone is a sphere and...")

2. The comparisons I'd really like to see are:

* cudf, heavy.ai: how does it compare to high-level python dataframe and SQL that already run in GPUs? How is perf, and what programs do you want people to be able to write and they cannot?

* Halide and other more general purpose languages that compile to GPUs that seem closer to where Bend is going

FWIW, it's totally fine to compare to other languages.

Instead of showing it is beating everywhere, or saying ideal speedups and no comparisons, show where it is strong vs weak compared to others and diff tasks, especially progression across different releases (bend1 vs 2 vs ...), and let folks decide. There is some subset of tasks you already care about, so separate those out and show the quality you get in them, so people know what it looks like when you care + happy path. The rest becomes 'if you need these others, stay clear for now, check in again later as we know and are tracking.' Being clear that wall clock time can be slow and performance per watt can be wasteful is OK, you are looking for early adopters, not OpenAI core engineers.

This is on CPU vs GPU.

A GPU core (shading unit) is 100x weaker than a CPU core, thus the difference.

ON the GPU, HVM's performance scales almost 16000x with 16000x cores. Thus the "near ideal speedup".

Not everyone knows how GPUs work, so we should have been more clear about that!