Comment by lazylizard

4 hours ago

i am not sre, merely sysadmin.

and somehow i have this impression that gpus on slurm/pbs could not be simpler.

u can use a vm for the head node, dont even need the clustering really..if u can accept taking 20min to restore a vm.. and the rest of the hardware are homogeneous - you setup 1 right and the rest are identical.

and its a cluster with a job queue.. 1 node going down is not the end of the world..

ok if u have pcie GPUs sometimes u have to re-seat them and its a pain. otherwise if ur h200 or disks fail u just replace them, under warranty or not...

0 comments

lazylizard

No comments yet

Contribute on Hacker News ↗