Comment by ATsch

4 years ago

There's really two problems here:

1. Contemporary mainstram OSes have not risen to the challenge of dealing appropriately with the multi-CPU, multi-address space nature of modern computers. The proportion of the computer that the "OS" runs on has been shrinking for a long time and there have only been a few efforts to try to fix that (e.g. HarmonyOS, nrk, RTKit)

2. Hardware vendors, faced with proprietary or non-malleable OSes and incentives to keep as much magic in the firmware as possible, have moved forward by essentially sandboxing the user OS behind a compatibility shim. And because it works well enough, OS developers do not feel the need to adjust to the hardware, continuing the cycle.

There is one notable recent exception in adjusting filesystems to SMR/Zoned devices. However this is only on Linux, so desktop PC component vendors do not care. (Quite the opposite: they disable the feature on desktop hardware for market segmentation)

9 comments

ATsch

btown 4 years ago

Are there solutions to this in the high-performance computing space, where random access to massive datasets is frequent enough that the “sandboxing” overhead adds up?

bayindirh 4 years ago
HPC systems generally use LustreFS where you have multiple servers handling metadata and objects (files) separately. These servers have multiple level of drives, where metadata servers are SSD backed and file servers run on SSD accelerated spinning disk boxes, with a mountain of 10TB+ drives.
When this structure is fed to a EDR/HDR/FDR Infiniband network, the result is a blazing fast storage system where you can make a massive number of random accesses by very large number of servers simultaneously. The whole structure won't shiver even.
There are also other tricks Lustre can pull for smaller files to accelerate the access and reduce the overhead even further, too.
In this model, the storage boxes are somewhat sandboxed, but the whole model as a general is mounted via its own client, so the OS is very close to the model Lustre provides.
On the GPU servers, if you're going to provide big NVMe scratch spaces (a-la nVidia DGX systems), you soft-RAID the internal NVMe disks with mdadm.
In both models, saturation happens on hardware level (disks, network, etc.) processors and other soft components doesn't impose a meaningful bottleneck even under high load.
- GauntletWizard 4 years ago
  
  Additionally, In the HPC space, power loss is not a major factor; backup power systems exist, and rerunning the last few minutes of a half-completed job is common, so on either side you are unlikely to encounter the fallout of "I clicked save, why didn't it save"?
  
  4 replies →
- shellac 4 years ago
  
  > HPC systems generally use LustreFS
  Or IBM's GPFS / Spectrum Scale. Same deal, really, although GPFS is a more complete package.
  
  1 reply →