Comment by lazylizard
4 hours ago
i am not sre, merely sysadmin.
and somehow i have this impression that gpus on slurm/pbs could not be simpler.
u can use a vm for the head node, dont even need the clustering really..if u can accept taking 20min to restore a vm.. and the rest of the hardware are homogeneous - you setup 1 right and the rest are identical.
and its a cluster with a job queue.. 1 node going down is not the end of the world..
ok if u have pcie GPUs sometimes u have to re-seat them and its a pain. otherwise if ur h200 or disks fail u just replace them, under warranty or not...
No comments yet
Contribute on Hacker News ↗