← Back to context

Comment by bayindirh

4 years ago

This is absolutely correct. Cattle vs. Pet analogy [0] applies perfectly there. On the other hand, HPC systems are far from being unprotected. Storage systems generally disable write caches on spinning drives automatically and have all on the fly data on either battery backed or flash based caches. So FS level corruption is kept at minimal levels.

Also, yes, many longer jobs are checkpoints and restart where it's left off, but it's not always possible, unfortunately.

[0]: https://blog.engineyard.com/pets-vs-cattle