Comment by georaa
5 hours ago
Scheduling is easy. The hard part is everything between "started" and "done" - task needs human approval at step 3, fails at step 5 (retry from 4 or from scratch?), takes 6 hours and something restarts. How do they handle tasks that span multiple inference calls? Is there checkpointing or does it start over?
No comments yet
Contribute on Hacker News ↗