← Back to context

Comment by Twirrim

8 months ago

For what it's worth, I ported the sum example to pure python.

    def sum(depth, x):
        if depth == 0:
          return x
        else:
          fst = sum(depth-1, x*2+0) # adds the fst half
          snd = sum(depth-1, x*2+1) # adds the snd half
          return fst + snd
        
    print(sum(30, 0))

under pypy3 it executes in 0m4.478s, single threaded. Under python 3.12, it executed in 1m42.148s, again single threaded. I mention that because you include benchmark information:

    CPU, Apple M3 Max, 1 thread: 3.5 minutes
    CPU, Apple M3 Max, 16 threads: 10.26 seconds
    GPU, NVIDIA RTX 4090, 32k threads: 1.88 seconds

The bend single-threaded version has been running for 42 minutes on my laptop, is consuming 6GB of memory, and still hasn't finished (12th Gen Intel(R) Core(TM) i7-1270P, Ubuntu 24.04). That seems to be an incredibly slow interpreter. Has this been tested or developed on anything other than Macs / aarch64?

I appreciate this is early days, but it's hard to get excited about what seems to be incredibly slow performance from a really simple example you give. If the simple stuff is slow, what does that mean for the complicated stuff?

If I get a chance tonight, I'll re-run it with `-s` argument, see if I get anything helpful.

Running on 42 minutes is mots likely a bug. Yes, we haven't done much testing outside of M3 Max yet. I'm aware it is 2x slower on non-Apple CPUs. We'll work on that.

For the `sum` example, Bend has a huge disadvantage, because it is allocating 2 IC nodes for each numeric operation, while Python is not. This is obviously terribly inefficient. We'll avoid that soon (just like HVM1 did it). It just wasn't implemented in HVM2 yet.

Note most of the work behind Bend went into making the parallel evaluator correct. Running closures and unrestricted recursion on GPUs is extremely hard. We've just finished that part, so, there was basically 0 effort into micro-optimizations. HVM2's codegen is still abysmal. (And I was very clear about it on the docs!)

That said, please try comparing the Bitonic Sort example, where both are doing the same amount of allocations. I think it will give a much fairer idea of how Bend will perform in practice. HVM1 used to be 3x slower than GHC in a single core, which isn't bad. HVM2 should get to that point not far in the future.

Now, I totally acknowledge these "this is still bad but we promise it will get better!!" can be underwhelming, and I understand if you don't believe on my words. But I actually believe that, with the foundation set, these micro optimizations will be the easiest part, and performance will skyrocket from here. In any case, we'll keep working on making it better, and reporting the progress as milestones are reached.

  • > it is allocating 2 IC nodes for each numeric operation, while Python is not

    While that's true, Python would be using big integers (PyLongObject) for most of the computations, meaning every number gets allocated on the heap.

    If we use a Python implementation that would avoid this, like PyPy or Cython, the results change significantly:

        % cat sum.py 
        def sum(depth, x):
            if depth == 0:
                return x
            else:
                fst = sum(depth-1, x*2+0) # adds the fst half
                snd = sum(depth-1, x*2+1) # adds the snd half
            return fst + snd
    
        if __name__ == '__main__':
            print(sum(30, 0))
    
        % time pypy sum.py
        576460751766552576
        pypy sum.py  4.26s user 0.06s system 96% cpu 4.464 total
    

    That's on an M2 Pro. I also imagine the result in Bend would not be correct since it only supports 24 bit integers, meaning it'd overflow quite quickly when summing up to 2^30, is that right?

    [Edit: just noticed the previous comment had already mentioned pypy]

    > I'm aware it is 2x slower on non-Apple CPUs.

    Do you know why? As far as I can tell, HVM has no aarch64/Apple-specific code. Could it be because Apple Silicon has wider decode blocks?

    > can be underwhelming, and I understand if you don't believe on my words

    I don't think anyone wants to rain on your parade, but extraordinary claims require extraordinary evidence.

    The work you've done in Bend and HVM sounds impressive, but I feel the benchmarks need more evaluation/scrutiny. Since your main competitor would be Mojo and not Python, comparisons to Mojo would be nice as well.

    • The only claim I made is that it scales linearly with cores. Nothing else!

      I'm personally putting a LOT of effort to make our claims as accurate and truthful as possible, in every single place. Documentation, website, demos. I spent hours in meetings to make sure everything is correct. Yet, sometimes it feels that no matter how much effort I put, people will just find ways to misinterpret it.

      We published the real benchmarks, checked and double checked. And then you complained some benchmarks are not so good. Which we acknowledged, and provided causes, and how we plan to address them. And then you said the benchmarks need more evaluation? How does that make sense in the context of them being underwhelming?

      We're not going to compare to Mojo or other languages, specifically because it generates hate.

      Our only claim is:

      HVM2 is the first version of our Interaction Combinator evaluator that runs with linear speedup on GPUs. Running closures on GPUs required colossal amount of correctness work, and we're reporting this milestone. Moreover, we finally managed to compile a Python-like language to it. That is all that is being claimed, and nothing else. The codegen is still abysmal and single-core performance is bad - that's our next focus. If anything else was claimed, it wasn't us!

      37 replies →

  • Bitonic sort runs in 0m2.035s. Transpiled to c and compiled it takes 0m0.425s.

    that sum example, transpiled to C and compiled takes 1m12.704s, so it looks like it's just the VM case that is having serious issues of some description!

I have no dog in this fight, but feel compelled to defend the authors here. Recursion does not test compute, rather it tests the compiler's/interpreter's efficiency at standing up and tearing down the call stack.

Clearly this language is positioned at using the gpu for compute-heavy applications and it's still in its early stages. Recursion is not the target application and should not be a relevant benchmark.

  • [flagged]

    • Okay, no. I know I called out performance in my post, but that was just from my observations. It surprised me to see something be that much slower than pure python. If you show me a near-python code example in a new language, as someone who mostly writes python code, I'm going to go and write it in python and see how it compares performance wise.

      The authors never made any kind of false claims at all. You're reading a lot in to both their README and my post.

      They've updated the README for a bit of clarity, but even re-reading the README as it was when I looked this morning (and even a few from before) it hasn't claimed to be fast. The claims are all related to the features that it does have, around parallelisation.

      1 reply →

"Thread" term in GPUs and CPUs mean different things, it's more like a SIMD lane in GPUs. A bit like ISPC can compile your code so there's eg 32 invocations of your function per CPU thread running on the same time (if you're using 16-bit datums on AVX512), and you could have 2048 executions going on after multiplying up 32 cores * 2 SMT threads/core * 32 compiler executions.

Python is really bad at recursion (part of why it's not appropriate for functional programming), so perhaps an unfair benchmark?

A Pythonic implementation would use loops and mutation.

Why `+0`, is this not a pointless no-op?

  • Yes, but when looking at the source it's more obvious this is a repeating pattern.

    "Hey, I'm accessing the 0th element here, just want to make that clear"

    Without the +0, that statement looks disconnected from the +1 even though conceptually its the same.

    Say somebody adds some special marker/tombstone/whatever into element 0 and now all those additions need to be bumped up by one. Someone else may go and see the +1, +2, +3 and just change them to +2, +3 +4, etc while completely missing the lone variable by itself as its visually dissimilar.

    Ive usually seen it used in longer lists of statements. It also keeps everything lined up formatting wise.