Comment by markhahn

2 days ago

understood: big facilities get shared; sharing requires arbitration and queueing.

an interesting angle on 50 threads and 256G: your data is probably pretty cool (cache-friendly). if your threads are merely HT, that's only 25 real cores, and might be only a single socket. implying probably <100 GB/s memory bandwidth. so a best-case touch-all-memory operation would take several seconds. for non-sequential patterns, effective rates would be much lower, and keep cores even less busy.

so cache-friendliness is really the determining feature in this context. I wonder how much these packages are oriented towards cache tuning. it affects basic strategy, such as how filtering is implemented in an expression graph...