Comment by nialv7

3 months ago

I think that's right, there is no better way than just adding barriers. On Apple hardware it can probably make use of the special memory ordering mode, but on normal ARM64 there's probably nothing it can do.

There’s one trick: run those threads on one cpu. But that may be slower than barriers on multiple CPU’s, unless the code uses a lot of library code that can be emulated directly, separately on other cpus.