← Back to context

Comment by LightMachine

8 months ago

Running on 42 minutes is mots likely a bug. Yes, we haven't done much testing outside of M3 Max yet. I'm aware it is 2x slower on non-Apple CPUs. We'll work on that.

For the `sum` example, Bend has a huge disadvantage, because it is allocating 2 IC nodes for each numeric operation, while Python is not. This is obviously terribly inefficient. We'll avoid that soon (just like HVM1 did it). It just wasn't implemented in HVM2 yet.

Note most of the work behind Bend went into making the parallel evaluator correct. Running closures and unrestricted recursion on GPUs is extremely hard. We've just finished that part, so, there was basically 0 effort into micro-optimizations. HVM2's codegen is still abysmal. (And I was very clear about it on the docs!)

That said, please try comparing the Bitonic Sort example, where both are doing the same amount of allocations. I think it will give a much fairer idea of how Bend will perform in practice. HVM1 used to be 3x slower than GHC in a single core, which isn't bad. HVM2 should get to that point not far in the future.

Now, I totally acknowledge these "this is still bad but we promise it will get better!!" can be underwhelming, and I understand if you don't believe on my words. But I actually believe that, with the foundation set, these micro optimizations will be the easiest part, and performance will skyrocket from here. In any case, we'll keep working on making it better, and reporting the progress as milestones are reached.

> it is allocating 2 IC nodes for each numeric operation, while Python is not

While that's true, Python would be using big integers (PyLongObject) for most of the computations, meaning every number gets allocated on the heap.

If we use a Python implementation that would avoid this, like PyPy or Cython, the results change significantly:

    % cat sum.py 
    def sum(depth, x):
        if depth == 0:
            return x
        else:
            fst = sum(depth-1, x*2+0) # adds the fst half
            snd = sum(depth-1, x*2+1) # adds the snd half
        return fst + snd

    if __name__ == '__main__':
        print(sum(30, 0))

    % time pypy sum.py
    576460751766552576
    pypy sum.py  4.26s user 0.06s system 96% cpu 4.464 total

That's on an M2 Pro. I also imagine the result in Bend would not be correct since it only supports 24 bit integers, meaning it'd overflow quite quickly when summing up to 2^30, is that right?

[Edit: just noticed the previous comment had already mentioned pypy]

> I'm aware it is 2x slower on non-Apple CPUs.

Do you know why? As far as I can tell, HVM has no aarch64/Apple-specific code. Could it be because Apple Silicon has wider decode blocks?

> can be underwhelming, and I understand if you don't believe on my words

I don't think anyone wants to rain on your parade, but extraordinary claims require extraordinary evidence.

The work you've done in Bend and HVM sounds impressive, but I feel the benchmarks need more evaluation/scrutiny. Since your main competitor would be Mojo and not Python, comparisons to Mojo would be nice as well.

  • The only claim I made is that it scales linearly with cores. Nothing else!

    I'm personally putting a LOT of effort to make our claims as accurate and truthful as possible, in every single place. Documentation, website, demos. I spent hours in meetings to make sure everything is correct. Yet, sometimes it feels that no matter how much effort I put, people will just find ways to misinterpret it.

    We published the real benchmarks, checked and double checked. And then you complained some benchmarks are not so good. Which we acknowledged, and provided causes, and how we plan to address them. And then you said the benchmarks need more evaluation? How does that make sense in the context of them being underwhelming?

    We're not going to compare to Mojo or other languages, specifically because it generates hate.

    Our only claim is:

    HVM2 is the first version of our Interaction Combinator evaluator that runs with linear speedup on GPUs. Running closures on GPUs required colossal amount of correctness work, and we're reporting this milestone. Moreover, we finally managed to compile a Python-like language to it. That is all that is being claimed, and nothing else. The codegen is still abysmal and single-core performance is bad - that's our next focus. If anything else was claimed, it wasn't us!

    • > I spent hours in meetings to make sure everything is correct. Yet, sometimes it feels that no matter how much effort I put, people will just find ways to misinterpret it.

      from reply below:

      > I apologize if I got defensive, it is just that I put so much effort on being truthful, double-checking, putting disclaimers everywhere about every possible misinterpretation.

      I just want to say: don't stop. There will always be some people who don't notice or acknowledge the effort to be precise and truthful. But others will. For me, this attitude elevates the project to something I will be watching.

    • That's true, you never mentioned Python or alternatives in your README, I guess I got Mandela'ed from the comments in Hacker News, so my bad on that.

      People are naturally going to compare the timings and function you cite to what's available to the community right now, though, that's the only way we can picture its performance in real-life tasks.

      > Mojo or other languages, specifically because it generates hate

      Mojo launched comparing itself to Python and didn't generate much hate, it seems, but I digress

      In any case, I hope Bend and HVM can continue to improve even further, it's always nice to see projects like those, specially from another Brazilian

      9 replies →

    • > I'm personally putting a LOT of effort to make our claims as accurate and truthful as possible, in every single place

      Thank you. I understand in such an early irritation of a language there are going to be lots of bugs.

      This seems like a very, very cool project and I really hope it or something like it is successful at making utilizing the GPU less cumbersome.

    • Perhaps you can add: "The codegen is still abysmal and single-core performance is bad - that's our next focus." as a disclaimer on the main page/videos/etc. This provides more context about what you claim and also very important what you don't (yet) claim.

      1 reply →

    • > I'm personally putting a LOT of effort to make our claims as accurate and truthful as possible, in every single place.

      I'm not informed enough to comment on the performance but I really like this attitude of not overselling your product but still claiming that you reached a milestone. That's a fine balance to strike and some people will misunderstand because we just do not assume that much nuance – and especially not truth – from marketing statements.

    • Identifying what's parallelizable is valuable in the world of language theory, but pure functional languages are as trivial as it gets, so that research isn't exactly ground-breaking.

      And you're just not fast enough for anyone doing HPC, where the problem is not identifying what can be parallelized, but figuring out to make the most of the hardware, i.e. the codegen.

      1 reply →

    • Naive question: do you expect the linear scaling to hold with those optimisations to single core performance, or would performance diverge from linear there pending further research advancements?

    • I think you were being absolutely precise, but I want to give a tiny bit of constructive feedback anyway:

      In my experience, to not be misunderstood it is more important to understand the state of mind/frame of reference of your audience, than to be utterly precise.

      The problem is, if you have been working on something for a while, it is extremely hard to understand how the world looks to someone who has never seen it (1).

      The second problem is that when you hit a site like hacker News your audience is impossibly diverse, and there isn't any one state of mind.

      When I present research, it always takes many iterations of reflecting on both points to get to a good place.

      (1) https://xkcd.com/2501/

    • I think the issue is that there is the implicit claim that this is faster than some alternative. Otherwise what's the point?

      If you add some disclaimer like "Note: Bend is currently focused on correctness and scaling. On an absolute scale it may still be slower than single threaded Python. We plan to improve the absolute performance soon." then you won't see these comments.

      Also this defensive tone does not come off well:

      > We published the real benchmarks, checked and double checked. And then you complained some benchmarks are not so good. Which we acknowledged, and provided causes, and how we plan to address them. And then you said the benchmarks need more evaluation? How does that make sense in the context of them being underwhelming?

      8 replies →

Bitonic sort runs in 0m2.035s. Transpiled to c and compiled it takes 0m0.425s.

that sum example, transpiled to C and compiled takes 1m12.704s, so it looks like it's just the VM case that is having serious issues of some description!