Comment by gnat

2 days ago

I find the best comments here to be ones where people use their knowledge and experience to discuss the relative strengths and weaknesses of the technology in the post. I see a bunch of short single-sentence comments here that add no value.

For my part, I see this pattern repeatedly at different places. The raw tools in the platforms are too codey and the third-party frameworks like Temporal seem overkill, so you build a scheduler and need to solve the problems OP did: only run once, know if it errored, etc.

But it's amazing how "it's firing off a basic action!" becomes a script, then becomes a script composed of reusable actions that can pick up where they left off in case of errors ... Over time your "it's just enough for us!" feature creeps towards the framework's functionality.

I'd be curious to know how long the OP's solution stays simple before it submits to the feature creep demands. (Long may complexity be fought off, though! Every day you can live without the complexity of full workflows is a blessing)

Maybe I'm just lucky to work at a place with good tools, but in my experience Temporal isn't super heavyweight to use compared to building your own even-very-simple scheduler.

And it's worth it because now you have Temporal, which is the bees knees as far as I'm concerned. I will gladly sing praises of any tool that saves me getting paged, and Temporal has that in spades.

  • Temporal is awful. Difficult to test, difficult to decouple from your domain code. At least that’s what I have seen in organizations. OP’s solution is rather understandable: with a couple of interfaces, you make the code easily testable.

  • second temporal. plus it gives you more freedom to write jobs in different languages... not that you would or should in most cases but there's definitely good reasons

    • Don’t do it onprem unless you want to spend six figures monthly on cassandra database nodes for pretty shit performance and face constant saas upselling and then discover how hard it is to migrate off of.

      Write your own scheduler.

      Oracle is cheaper in the long run.

Cloud companies also provide globe-scale cronjobs that work a lot like a Unix cronjob. Arguably less mental overhead than adopting a separate framework.

And such a service provides reliability guarantees.

If I have to do a reliable periodic service, my go-to is a kubernetes cronjob, which is like a baby version of a cloud cronjob. I'd be reluctant to adopt some sort of task queue framework because of the complexity of the mental model plus the complexity of keeping one more thing running reliably. K8s is already running reliably, I might as well use that.

The pragmatic answer is Jenkins. Always has been.

  • Jenkins is a place where you can be safe for a long time, however, it starts to break down at scale. I see it time after time for these batch workflow jobs. At the start, jobs run in seconds and everyone is happy.

    Over time, jobs start taking long enough to the point where you need to split them. Separate jobs are assigned slices of the original batch. Eventually, there are so many slices that you make a Jenkins job where the sole responsibility is firing off these individual jobs.

    Then you start hitting the real painpoints in Jenkins. Poor allocation of jobs across your nodes/agents, often overloading CPU/Mem on machines, and you struggle to manage the ungodly interface that is the Jenkins REST endpoint. You install many Jenkins addons to try and address the scheduling problems, and end up with a team dedicated to managing this Jenkins infrastructure.

    The scaling struggles continue to amass and you end up needing separate Jenkins instances to battle the load. Any attempt at replacing the Jenkins infrastructure goes on standstill, as the amount of random scripts found in Jenkinsfiles has created an insurmountable vendor lock-in.

    You read a post about a select-for-update job scheduler and reflect on simpler times. You cry as you refactor your Jenkins Groovy DSL.

  • Ugh no. It was good enough for its time, but times have moved on.

    The danger is that it's so easy to start and it's decent for small and simple applications. Once your jobs start growing, both in number of contributors and in workload, the problems start. DSL is difficult to debug, plugins are buggy and the brittle master node will become your most precious pet that need constant supervising to not grind the whole system to a stop. By the time you realize this you have a hard time to get out of this lockin.

  • Jenkins is terrible for just about everything. Cron has real problems but at least you can version control the crontab. Jenkins is fat, hard to work with since you'll just have one shared instance, and everything is burred in special objects hidden behind a very unergonomic and undiscoverable web GUI.

    • ? You can (and should) version control your Jenkins config as well, including the pipeline codes.