Comment by markhahn
2 days ago
understood: big facilities get shared; sharing requires arbitration and queueing.
an interesting angle on 50 threads and 256G: your data is probably pretty cool (cache-friendly). if your threads are merely HT, that's only 25 real cores, and might be only a single socket. implying probably <100 GB/s memory bandwidth. so a best-case touch-all-memory operation would take several seconds. for non-sequential patterns, effective rates would be much lower, and keep cores even less busy.
so cache-friendliness is really the determining feature in this context. I wonder how much these packages are oriented towards cache tuning. it affects basic strategy, such as how filtering is implemented in an expression graph...
No comments yet
Contribute on Hacker News ↗