Comment by temporal_thr123
2 days ago
I run a large on-prem temporal setup - throwaway acct as they will likely out me.
Temporal is, in my opinion having run it in prod for over a year - poorly designed, slow and ridicliously heavy infra wise.
If you're doing anything non-trivial (say, 200+ events/workflow) and you need to run only a couple hundred of them concurrently all day, you're going to spend millions on infra, and it's still going to absolutely suck.
Try running their own benchmarks, the numbers are pathetic.
Their sales team is also absolutely appalling and desperate.
From a Developer standpoint, the SDK is quite nice though.
Don't get trapped into nexus, and if the sales team call you make sure legal is in the room.
Since I'm in a ranting mode -- here's a good example: you're limited to _ONE_ IO per shard in the history service:
https://github.com/temporalio/temporal/blob/e22e6304b3c4a409...
https://github.com/temporalio/temporal/blob/e22e6304b3c4a409...
Temporal does a crazy amount of database operations and all of these are behind that mutex.
Oh, and you can't change the shard count on existing clusters.
Great stuff.
Honest question: Can you use Temporal Cloud? Have you evaluated Temporal Cloud pricing?
Ballparking: 200 events/workflow, 200 workflows/per day and assuming 1 event = 1 cloud action[1], that is 1.2M or so actions per month. The $100/month plan includes 1M actions each month, and even the pay-as-you pricing when you exceed that is $50 per 1M actions[2].
Temporal Cloud seems extremely cheap for your use case, even if I'm off by a factor of 10. Is there a catch? You still need infra to run your Temporal workers, and I assume there are storage and other costs, but I assume action usage is the majority of it.
1. Not sure exactly what constitutes an "Action". At a glance, seems like most events have a corresponding action(?) and a subset of those actions are actually billable(?)
2. https://docs.temporal.io/cloud/pricing#payg-action-pricing
I was not clear; I did not mean not 200 a day, it's 10s of thousands of concurrently running workflows, sometimes into the hundreds of thousands, each with 200 events. We run many hundreds of thousands of these a day.
Temporal was a bad fit for us, and we regret it deeply.
Ah. So multiple billion actions per month, and probably multiple million dollars per year on their cloud, if they can even support that load (plus, the vendor lock in and etc). Makes sense.
what would you use instead?
> If you're doing anything non-trivial (say, 200+ events/workflow) and you need to run only a couple hundred of them concurrently all day, you're going to spend millions on infra, and it's still going to absolutely suck.
Where are the “millions” on infra going? It’s a handful of services and a Postgres?
> Their sales team is also absolutely appalling and desperate.
You said “on-prem”. It’s open source; why are you dealing with their sales team?
> If you're doing anything non-trivial (say, 200+ events/workflow) and you need to run only a couple hundred of them concurrently all day…
If “millions” were required to obtain such tiny scale, I’d agree there’d be a massive problem. No one would use Temporal; it would be a complete waste of resource. If this were true.
We also hit scaling problems with temporal.
Postgres doesn't scale at all four our workload, so you're into cassandra.
For a medium sized deployment, you're looking at 200+ vcpus, and then lets say standard dev/uat/prod. So now you're at 600 cpus. Now you need two geographic regions, dev can stay in one place, so now you're at 800. Want a failover cluster for prod? Have another 200 cpus.
and 200 CPUs is a medium deployment, assuming something like 36 cpus per cassandra node, then say 4-8 per instance of matching, worker, history, frontend. Then all your other components around it, ingress controller, service mesh, etc.
There's a million a year easy, for a small deployment.
Our prod one is 4x this size.
Not a couple hundred in one day, a couple hundred being started, concurrently, every second in a day. Each with ~200 events.
We need a 12 node cassandra cluster for this, with 64cpu nodes. So no, it's not a couple of services and a postgres.
Sales team, as we are an enterprise, and they want to extract money from us.
We’re all enterprise.
If you have 200 WF’s/sec each with 200 events, it sounds to me that you have a sizeable amount of work flowing through this system. 17 million workflows per day? Can I call these transactions?
Do these transactions add value to your business? Do you need durable execution for all these workloads?
Temporal is just a tool; and like any tool it can be misused. For the classic “book a hotel + airline, handle the partial failures” case, 17 million bookings a day would imply you should be thrilled with Temporal.
If you are using it to perform WAF in a firewall; you would be less thrilled. The scale you are describing, and that you aren’t super excited about the incredible amount of money pouring in, makes me question if the use-cases are fitting the tool.
The same with any "open-source" enterprise ($$$) software. It sucks to run yourself. Docs on running/errors are non-existent. Their helm charts are broken. Instead of degraded performance, it just fails.
Yeah, they've had so much VC cash pumped in lately they really need to pump the SAAS side of the business.
With all due respect – if that’s the attitude, you have no business running anything on-prem. And that’s fine, there’s a reason the various cloud providers are the go-to for many businesses.
1 reply →
Agree. Have worked in a codebase using Temporal, and is pretty much a nightmare. I don't know about the infra side, but from the developer side, all the abstractions they bring to the table are poorly designed. Wouldn't recommend
Biggest design bug imo is the workers need to register for the workflows they support, but will happily pull tasks from unrelated workflows if they're on the same queue. No way to put failed tasks back into the queue again either.
> if the sales team call you make sure legal is in the room.
What's the deal? It couldn't harm just listening to sales, could it?
I presume legal would it be involved before anything is signed in any case?
I think critical parts of openai run on temporal
[dead]