Comment by svara
21 hours ago
Word on HN is that you're either paying more money than you expected for temporal's managed solution or taking on substantial ops burden ultimately running their very heavy system yourself.
I wouldn't know, I've not done either, but I'd like to learn more from your or other's experience.
I told an agent to set it up for me for some local stuff. It is written in Go. It has a painless path to run on a local SQLite DB. My agents use it to organize and coordinate workflows. It handles retries and long horizon tasks fine. As far as I can tell for the core workflows and tasks pieces it’s great. MIT license. Like anything it isn’t free to manage but it offers a lot in return. High reliability systems are hard. Temporal only solves some of it. It is far better than rolling it yourself.
I think a genuine problem right now is people are building agentic work flows and learning the hard way highly reliable agentic work flows are hard. Agents are unreliable. They are both not deterministic and not the backing APIs have pretty high error rates. Temporal has solved that pain for me and made it easy to diagnose problems.
I don’t have anything really large scale running. But big enough that it takes billions of tokens and high reliability to finish.
whats an example of things that you have your agents do that use workflows and sqlite db
Autonomous C to Rust. Automated penetration testing and vuln validation.
3 replies →
Could you expand on the "substantial ops burden"? Let's say you're using a managed Postgres instance as the underlying data store, how substantial is the ops burden in that case? I understand that temporal is actually a set of 4 or so microservices on top of a data store, but if you're already running a distributed system backed by k8s or something like that, it doesn't seem like it adds significant incremental ops on top of that. But I could be wrong.
I run my own temporal service in my k8s cluster; this setup is the backbone for almost all my applications. For simplicity, I opted for the postgres backend. You still need to run the 4 (?) other service (history, matching, frontend, ui, maybe others, definitely others if you want observability with prometheus/grafana, and tad bit more complexity if you want tailscale to get in there and poke around).
They ship Helm charts so reality is somewhere between "helm deploy" and "substantial ops burden". I don't have to touch it very frequently, but that is not to say I don't have to touch it. There's occasional releases and there have been times where (probably due to my inexperience with helm) I botched an upgrade and lost some data. And I've been on this journey for years; when I first started, they didn't have a Python SDK and it was one of my (many) excuses to learn Go. But anyway to your point, yes, if you're comfortable with k8s and Helm then you shouldn't have much of a problem running hundreds of thousands of workflows; if you want to really push the throughput and optimize cost you probably need to get creative the individual services and look into cassandra (maybe? idk).
As a dev I would tell you its an ops burden.
My devops coworker just shrugs, pumps out some yaml and helm and away it goes.
It really depends on your experience and tolerance for a lot of things.
Usually maintenance burden doesent start to make itself known till you get off the happy path or something breaks. Sometimes it can be a long while before that happens, sometimes it happens right away.
I think it depends a lot on the operational maturity of the company. Some places are running the LGTM observability stack, sentry for error reporting, 24/7 on call rotations, playbooks for all alerts, etc. Those organizations will have less issues running systems like temporal because the operational framework is already there.
Other orgs have never heard of alerts or error reporting and naturally will not catch issues until they are catastrophic (for example services that crash frequently in the background go unnoticed until the crash frequency causes a catastrophic failure). In my experience a lot of issues are pretty simple such as running out of memory, CPU throttling, crashes caused by simple bugs (nil panics). If you have good observability you can catch those issues early.
For example: people rag on Ceph that their cluster somehow got into a broken state, but that really only occurs when abuse of the ceph cluster has went on long enough that the cluster finally reaches the tipping point where it is unrecoverable. If you set ceph up, follow the correct replication rules so components are spread across failure domains, and use the metrics and alerts that are distributed with ceph it is actually quite hard to break the cluster.
In my experience with a relatively modest number of concurrent workflows (think hundreds) you'll be pushing several thousand transactions per second through that postgres instance.
As best I can tell it doesn't do any batching of it's writes/reads, and it's update heavy in places rather than append (I suspect their cloud version might do some of these things)
It's pretty close to "let's make every function call serialise it's parameters/return value, go through a postgres table and several network hops"
That said it can be very useful, but it's a heavy tool that's best suited for high value/risk workflows where you're earning enough from the execution that you can afford the overhead (for example an Uber trip with several dollars of service fees is probably a good fit, unsurprisingly since it's roots are from Uber)
Very heavy indeed, people will confuse the durability that Temporal provide with all the other properties a distributed system needs. They will then think that Temporal will solve all their problems.
Their managed solution is pricey and especially the linear scaling with how much you use it is very meh. It's comparable with AWS lambda which also isn't cheap. However it's minor on a typical cloud bill.
Self-hosting is very easy in my experience, I've done it for 2 years but management wanted to move to Temporal Cloud. They have a helm chart which just works including upgrades. This does assume you have the whole k8s shebang set up and working in your company. I never had to touch is outside upgrades which took maybe 30m including validation.
use oban and call it a day: https://oban.pro/