Comment by umvi
4 hours ago
Is using virtualization the only good way of taking a 288-core box and splitting it up into multiple parallel workloads? One time I rented a 384-core AMD EPYC baremetal VM in GCP and I could not for the life of me get parallelized workloads to scale just using baremetal linux. I wanted to run a bunch of CPU inference jobs in parallel (with each one getting 16 cores), but the scaling was atrocious - the more parallel jobs you tried to add, the slower all of them ran. When I checked htop the CPU was very underutilized, so my theory was that there was a memory bottleneck somewhere happening with ONNX/torch (something to do with NUMA nodes?) Anyway, I wasn't able to test using proxmox or vmware on there to split up cpu/memory resources; we decided instead to just buy a bunch of smaller-core-count AMD Ryzen 1Us instead, which scaled way better with my naive approach.
They are used for VMs because the load is pretty spiky and usually not that memory heavy. For just running single app smaller core count but higher clocked ones are usually more optimal
>Anyway, I wasn't able to test using proxmox or vmware on there to split up cpu/memory resources; we decided instead to just buy a bunch of smaller-core-count AMD Ryzen 1Us instead, which scaled way better with my naive approac
If that was single 384 (192 times 2 for hyperthreading) CPU you are getting "only" 12 DDR5 channels, so one RAM channel is shared by 16c/32y
So just plain 16 core desktop Ryzen will have double memory bandwidth per core
How did the speed of one or two jobs on the EPYC compare to the Ryzen?
And 384 actual cores or 384 hyperthreading cores?
Inference is so memory bandwidth heavy that my expectations are low. An EPYC getting 12 memory channels instead of 2 only goes so far when it has 24x as many cores.