Comment by convolvatron

2 days ago

in larger systems the utility of sharing a single cpu/gpu complex between independent authorization domains kind of goes away. if you have 10,000 units of allocation, it never makes sense to try to share one of those until you have more than 10,000 jobs, and even then.

so it seems a lot more feasible to control access and sharing between those units and write of off the intranode case as a lost cause

2 comments

convolvatron

Joel_Mckay 2 days ago

In such arrangements, one has essentially enforced high-latency similar context isolation using encrypted/VLAN network fabric, and pushed coordination/permissions into back-plane supervisory subsystems. Still creating a monolithic permission domain vulnerability within the entire n<10000 node cluster partition.

Likely doesn't help OS users either way. Best regards =3

convolvatron 2 days ago

you kinda missed my point. already in the cluster the important filesystem is the distributed one. the important job management system is the distributed one. the local OS just effectively supports the single process that we really care about. so the distributed context is where we add capabilities and actually manage access and resources. that is the real OS.