Comment by georaa

5 hours ago

Scheduling is easy. The hard part is everything between "started" and "done" - task needs human approval at step 3, fails at step 5 (retry from 4 or from scratch?), takes 6 hours and something restarts. How do they handle tasks that span multiple inference calls? Is there checkpointing or does it start over?

0 comments