Comment by rao-v

13 hours ago

I still don’t understand why we lack a language that will take uncomplicated computation heavy code and turn it into SIMD / multi thread / multiprocessing / GPU code with minimal additional syntax.

Surely this is the sort of thing compiler / language design nerds dream about?

It doesn’t have to guarantee efficiency or provide cutting edge performance in any context … it should just exist!

My understanding is that we can make such a language … but it’s not caught the fancy of someone who could do it

Still a bit early but I'm working on kiwi, a k-dialect that can lower to Apple MLX.

Currently supports CPU and GPU on macOS and CPU on linux.

https://kiwilang.com

https://github.com/kiwi-array-lang/kiwi

Kiwi runs computations on small dense arrays in its own runtime, when they are larger it will lower to MLX CPU and eventually to MLX GPU when it is worth it.

As user you don't have to change any code, you just write k.

I'm sure there are other languages designed to take advantage of modern GPUs.

But even with SIMD you can get quite far with array oriented code and many array language implementations will make use of it (BQN, ngn/growler/k, goal, ktye k has a version with SIMD support, …)

  • Thanks for sharing this is neat!

    I’ve yet to find a language that does SIMD / multithreading / GPU with minimal tweaks let along multiprocessing.

Both ahead of time compilers and JIT compilers often perform autovectorization of tight loops. The problem is that lots of hot loops are not necessarily simple loops, and in particular a lot of source code is written in a way which uses sequential dependencies that can’t be modeled in SIMD code. Aside from undefined behavior in C/C++, most compilers will fail to autovectorize because doing so would very slightly change the behavior of your code in a very hard to understand way.

  • Surely a high level language can own the contract of making sane choices of when to auto vectorize and when not to (or just inefficiently auto vectorize - that is fine too!)

    • That’s like saying “surely a high level language can solve the halting problem.”

      Yes, it can, but only by eliminating the features that make it Turing complete. It’s relatively easy to vectorize map with a closure that can’t mutate anything but once you have nontrivial control flow, the compiler can’t make those kinds of assumptions.

      2 replies →

Intel Ispc is a compiler for a C superset language that targets CPU SIMD and GPUs.

  • A beautiful find! It’s what 12+ years old at this point?

    Definitely the closest thing so far (doesn’t do multiprocessing) but does seem to do SIMD / multithreading and GPU auto parallelizing!

    Any idea why it’s so little known?

If you're happy with NumPy's API, then surely JAX is exactly what you're looking for.

  • JAX can’t do what Numba can do for example. I just want one way to write simple math-y code like you normally would and automagically convert to run on one of the above approaches.

    That’s what compilers and high level languages are supposed to be for!

>I still don’t understand why we lack a language that will take uncomplicated computation heavy code and turn it into SIMD / multi thread / multiprocessing / GPU code with minimal additional syntax.

It's already (partly) existed called D language, by default it's garbage collected (GC), can also be program without it or hybrid. It's a modern, backward compatible with C and it's included in GCC.

The linear algebra system in D or Mir GLAS is standalone BLAS implementation written directly in D [1]. It's already proven faster than the other widely existing conventional BLAS like OpenBLAS back in 2016, about ten years ago!

This popular OpenBLAS include Fortran based LAPACK (yes you read it right Fortran) and it is being used by almost all data processing languages currently Matlab, Julia, Rust and also Mojo [2].

Interestingly there is a very early stage of standalone BLAS implementation written directly in Mojo namely mojoBLAS similar to Mir GLAS just started very recently [3].

>Surely this is the sort of thing compiler / language design nerds dream about?

You can say this again.

Especially on the GC side of the programming language since this SIMD / multi thread / multiprocessing / GPU can be abstracted away.

Actually someone recently proposed VGC or virtualized garbage collector for Python in C++ for heteregenous GC [4],[5]. However, the current evaluation excludes JIT compilation, AOT optimization, SIMD acceleration, and GPU offloading.

[1] OpenBLAS:

https://en.wikipedia.org/wiki/OpenBLAS

[2] Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen:

http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...

[3] mojoBLAS:

https://github.com/shivasankarka/mojoBLAS

[4] Virtual Garbage Collector (VGC): A Zone-Based Garbage Collection Architecture for Python's Parallel Runtime:

https://arxiv.org/abs/2512.23768

[5] VGC-for-arxiv:

https://github.com/Abdullahlab-n/VGC-for-arxiv

  • I don't think mojo depends on OpenBLAS or other BLAS implementation. I remember that they took a lot of pride in the early days how linalg primitives like matmul which was completely written in mojo was faster than MLK, openBLAS and other implementations.

  • Delightful thank you! Would love to see a version of D that auto vectorizes to Vulkan or something