Comment by NGRhodes

4 years ago

Hours and days of jobs need to be rerun in my experience, our researchers do a poor job of check pointing. Of all the issues we have with lustre, data loss has never been one whilst I have been in the team.

2 comments

NGRhodes

throw0101a 4 years ago

> Hours and days of jobs need to be rerun in my experience, our researchers do a poor job of check pointing.

Enabling pre-emption in your queues by default and that'll change: after a job is scheduled and run for 1-2 hours it can be kicked out and a new one run instead after the first's priority decays a bit.

* https://slurm.schedmd.com/preempt.html

You can add incentives:

> When would I want to use preemption? When would I not want to use it?

> When a job is designated as a preemptee, we increase the job's priority, and increase several limits, including the maximum number of running processors or jobs per user, and the maximum running time per job. Note that these increased limits only apply to the preemptable job. This allows preemptable jobs to potentially run on more resources, and for longer times, than normal jobs.

* https://rc.byu.edu/documentation/pbs/preemption

bayindirh 4 years ago

> Enabling pre-emption in your queues by default and that'll change.
We run preemptive queues, and no. Not all jobs are compatible with that. Esp. the code researchers developed themselves.
My own code also doesn't have support for checkpointing. Currently it's blazing fast, but for bigger jobs it might need the support, and it needs way more cogs inside the pipeline to make it possible.