Comment by mike_hearn
2 days ago
Using the database for queues is more than fine, it's often essential to correctness. In many use cases for queues you need to atomically update the database with respect to popping from the queue, and if they're separate systems you end up needing either XA or brittle and unreliable custom idempotency logic. I've seen this go wrong before and it's not nice, the common outcome is business-visible data corruption that can have financial impact.
This seems like another case where Postgres gets free marketing due to companies hitting its technical limits. I get why they choose to make lemonade in these cases with an eng blog post, but this is a way too common pattern on HN. Some startup builds on Postgres then spends half their eng budget at the most critical growth time firefighting around its limits instead of scaling their business. OpenAI had a similar blog post a couple of months ago where they revealed they were probably spending more than quarter of a million a month on an Azure managed Postgres, and it had stopped scaling so they were having to slowly abandon it, where I made the same comment [1].
Postgres is a great DB for what you pay, but IMHO well capitalized blitzscaling startups shouldn't be using it. If you buy a database - and realistically most Postgres users do anyway as they're paying for a cloud managed db - then you might as well just buy a commercial DB with an integrated queue engine. I have a financial COI because I have a part time job there in the research division (on non-DB stuff), so keep that in mind, but they should just migrate to an Oracle Database. It has a queue engine called TxEQ which is implemented on top of database tables with some C code for efficient blocking polls. It scales horizontally by just adding database nodes whilst retaining ACID transactions, and you can get hosted versions of them in all the major clouds. I'm using it in a project at the moment and it's been working well. In particular the ability to dequeue a message into the same transaction that does other database writes is very useful, as is the exposed lock manager.
Beyond scaling horizontally the nice thing about TxEQ/AQ is that it's a full message queue broker with all the normal features you'd expect. Delayed messages, exception queues, queue browsing, multi-consumer etc. LISTEN/NOTIFY is barely a queue at all, really.
For startups like this, the amount of time, money and morale they are losing with all these constant stories of firefights just doesn't make sense to me. It doesn't have to be Oracle, there are other DBs that can do this too. But "We discovered X about Postgres" is a eng blog cliché by this point. You're paying $$$ to a cloud and GPU vendor anyway, just buy a database and get back to work!
Using queues in atomic, transactional way was a core principle for building https://pgflow.dev - having whole workflow state transactionally updated alongside the work on the in db queue really simplifies a lot of things: debugging is easier, audit log is easy, reporting, stats etc are one SQL query away.
Nice! I'm also using queues as part of a workflow engine.
Oh really? Would love to check it out and borrow some ideas! :)
Looks interesting- but why the Supabase dependency? That’s a much tighter requirement than a vanilla PostgreSQL extension or something like PostgREST
Valid point!
So pgflow is really agnostic and Postgres is it's fundamental dependency. All components are modular and ready to be adapted to other runtimes.
It's just that Supabase is what I use and I figured out this will be my first platform, but the abstraction to port to others is there!
>> but they should just migrate to an Oracle Database
No big tech companies or unicorn type startups are using Oracle. Is your claim that they are all wrong?
>> Some startup builds on Postgres then spends half their eng budget at the most critical growth time firefighting around its limits instead of scaling their business
This is why I suggest starting with some kind of normal queue / stream mechanism and columnar DB if needed. It isn't even harder than using one DB, particularly if you are using niche features.
Big tech companies do use it. Apple was advertising for Oracle DBA roles just last month. And consider that Amazon had to staff a massive multi-year project to migrate off it to their own in-house DB, which they only did because they had become a competitor.
W.R.T. unicorn type startups; yes, my argument is that they are all wrong and should be using a different database. There's competitive advantage to be had there.
It actually depends on the workload.
Sending webhooks, as an example, often has zero need to go back and update the database, but I've seen that exact example take down several different managed databases ( i.e., not just postgres ).
Yes that's true but in good implementations you will want to surface to the recipient via some dashboard if delivery consistently fails. So at some point a message on the exception queue will want to update the db.
> "We discovered X about Postgres" is a eng blog cliché by this point.
It really is, and it’s often surprising to me how basic some of the issues are being discovered. Like Figma, when they waited a shocking amount of time add [0] PgBouncer and read replicas. This is such a well-trod path that it’s baffling to me why you wouldn’t add it once it’s clear you have a winning product. At the very least, PgBouncer (or PgCat, or any other connection pooler / proxying service) - it adds negligible cost per month (in comparison to DB read replicas) to run a couple of containers with a load balancer.
Re: Oracle, as much as I despise the company for its litigious practices, I’ll hand it to you that the features your DB has are astonishing. RAC is absolutely incredible (on paper - I’ve never used it).
[0]: https://www.figma.com/blog/how-figma-scaled-to-multiple-data...
if you need transaction across a queue into a normal SQL DB or similar I believe you are doing something very wrong.
Sure you need transaction about processing things in a queue (mark as "taken out", but not yet remove then remove or "place back in (or into a failed messages inbox)" on timeout or similar can be _very_ important for queue systems.
But the moment the "fail save if something dies while processing a message" becomes a directly coupled with DB transactions you have created something very brittle and cumbersome.
To be fair that might still be the best solution for some situations.
But the better solution is to make sure you treat a queue as message passing system and handle messages as messages with the appropriate delivery semantics. And if you can't because idk. idempotency logic is supper unreliable then there indeed is a problem, but its not in the missing cross transactions but how you write that logic (missing ?_tooling_, strict code guidelines people actually comply with, interface regression checks, tests (including prop/fuzz tests, regression tests, integration/e2e tests etc., not just "dump" unit test)).
> just migrate to an Oracle Database.
In my experience while Oracle DB is very powerful but also very cumbersome in a lot of ways and if you need thing only they can provide you most likely already fucked up big time somewhere else in your design/architecture. Sure if you are at that point Oracle can lightly be the cheaper solution. But still preferable you never endup there.
As a side note, there are also a lot of decent plugins which can provide similar capabilities to PG, but they tend to have the issue that they aren't part of managed PG solutions and self managing PG (or most other reasonable powerful DB) can be a huge pain, and then yes Oracle can be a solution.
Still the amount of startups which had a overall good experience are in my experience overall non existing in my experience. (But there are some pretty big companies/projects I know of which have a overall good experience with Oracle.)
> constant stories of firefights
If you mean stories on HN, than that isn't a meaningful metric, you will only hear about the "interesting" stories which mostly are about fire fighting or "using pg for everything is grate" but rarely the majority of in-between stories and boring silent successes. If it's about stories from you carriers and asking dev friends you have what their experience is then it is more meaningful. But in a bubble (like this answer of mine is, without question, in a bubble).
Generally I think people really overestimate how representative HN is, idk. about the US but outside of it _huge_ parts of the IT industry are not represented by HN in any meaningful way. I would say in my country HN is _at most_ representative for 1/4 of the industry, but that 1/4 also contains many of the very very motivated software developers. But also very few of the "that my work but not my calling", "bread and butter" work software devs, which are often 1/3 to over 1/2 of devs in most countries as far as I can tell.
>> To be fair that might still be the best solution for some situations.
It is arguable. Let's say your team knows Postgres well from a relational standpoint. Now they need to do something messages and require some type of notification / messaging. There is a learning curve here anyway. I'd argue they should spend it on more standard approaches which are not harder to start with. Of course, if you know that your site / service will only be used by yourself and your grandmother do whatever you want (just use a text file or better yet just call her instead).
>But the moment the "fail save if something dies while processing a message" becomes a directly coupled with DB transactions you have created something very brittle and cumbersome.
The standard workflow for processing something from a queue is to keep track of all the messages you have already processed in the transactional database and simply request the remaining unprocessed messages. Often this is as simple as storing the last successfully processed message ID in the database and updating it in the same transaction that has processed the message. If an error occurs you roll the transaction back, which also rolls back the last message ID. The consumer will automatically re-request the failed message on the next attempt, giving you out of the box idempotency for at least once messaging.
My approach is to have fields for started/completed where started includes the system/process/timestamp of when an item was started... this gets marked as part of the process to tag and take the next item by the worker(s). It also allows for sweep and retry.
That said, I tend to reach for redis/rabbit or kafka relatively early depending on my specific needs and what's in use. Main use of a dbms queue historically is sending/tracking emails where the email service I had been using was having hiccups.