Comment by gdwatson
1 day ago
Superscalar processors (which include all mainstream ones these days) do this within a single core, provided there are no data dependencies between the assignment statements. They have multiple arithmetic logic units, and they can start a second operation while the first is executing.
But yeah, I agree that we were promised a lot more automatic multithreading than we got. History has proven that we should be wary of any promises that depend on a Sufficiently Smart Compiler.
Eh, in this case not splitting them up to compute them in parallel is the smartest thing to do. Locking overhead alone is going to dwarf every other cost involved in that computation.
Yeah, I think the dream was more like, “The compiler looks at a map or filter operation and figures out whether it’s worth the overhead to parallelize it automatically.” And that turns out to be pretty hard, with potentially painful (and nondeterministic!) consequences for failure.
Maybe it would have been easier if CPU performance didn’t end up outstripping memory performance so much, or if cache coherency between cores weren’t so difficult.
Spawning threads or using a thread pool implicitly would be pretty bad - it would be difficult to reason about performance if the compiler was to make these choices for you.
I think it has shaken out the way it has, is because compile time optimizations to this extent require knowing runtime constraints/data at compile time. Which for non-trivial situations is impossible, as the code will be run with too many different types of input data, with too many different cache sizes, etc.
The CPU has better visibility into the actual runtime situation, so can do runtime optimization better.
In some ways, it’s like a bytecode/JVM type situation.
2 replies →
I think you’re fixating on the very specific example. Imagine if instead of 2 + 2 it was multiplying arrays of large matrices. The compiler or runtime would be smart enough to figure out if it’s worth dispatching the parallelism or not for you. Basically auto vectorisation but for parallelism
Notably - in most cases, there is no way the compiler can know which of these scenarios are going to happen at compile time.
At runtime, the CPU can figure it out though, eh?
6 replies →