← Back to context

Comment by ownagefool

1 day ago

Largely agree. Functionality wise if you don't have many jobs, using the database as the queue is fine.

However, I've been in several situations where scaling the queue brings down the database, and therefore the app, and am thus of the opinion you probably shouldn't couple these systems too tightly.

There are pros and cons, of course.

Using the database for queues is more than fine, it's often essential to correctness. In many use cases for queues you need to atomically update the database with respect to popping from the queue, and if they're separate systems you end up needing either XA or brittle and unreliable custom idempotency logic. I've seen this go wrong before and it's not nice, the common outcome is business-visible data corruption that can have financial impact.

This seems like another case where Postgres gets free marketing due to companies hitting its technical limits. I get why they choose to make lemonade in these cases with an eng blog post, but this is a way too common pattern on HN. Some startup builds on Postgres then spends half their eng budget at the most critical growth time firefighting around its limits instead of scaling their business. OpenAI had a similar blog post a couple of months ago where they revealed they were probably spending more than quarter of a million a month on an Azure managed Postgres, and it had stopped scaling so they were having to slowly abandon it, where I made the same comment [1].

Postgres is a great DB for what you pay, but IMHO well capitalized blitzscaling startups shouldn't be using it. If you buy a database - and realistically most Postgres users do anyway as they're paying for a cloud managed db - then you might as well just buy a commercial DB with an integrated queue engine. I have a financial COI because I have a part time job there in the research division (on non-DB stuff), so keep that in mind, but they should just migrate to an Oracle Database. It has a queue engine called TxEQ which is implemented on top of database tables with some C code for efficient blocking polls. It scales horizontally by just adding database nodes whilst retaining ACID transactions, and you can get hosted versions of them in all the major clouds. I'm using it in a project at the moment and it's been working well. In particular the ability to dequeue a message into the same transaction that does other database writes is very useful, as is the exposed lock manager.

Beyond scaling horizontally the nice thing about TxEQ/AQ is that it's a full message queue broker with all the normal features you'd expect. Delayed messages, exception queues, queue browsing, multi-consumer etc. LISTEN/NOTIFY is barely a queue at all, really.

For startups like this, the amount of time, money and morale they are losing with all these constant stories of firefights just doesn't make sense to me. It doesn't have to be Oracle, there are other DBs that can do this too. But "We discovered X about Postgres" is a eng blog cliché by this point. You're paying $$$ to a cloud and GPU vendor anyway, just buy a database and get back to work!

[1] https://news.ycombinator.com/item?id=44074506

  • Using queues in atomic, transactional way was a core principle for building https://pgflow.dev - having whole workflow state transactionally updated alongside the work on the in db queue really simplifies a lot of things: debugging is easier, audit log is easy, reporting, stats etc are one SQL query away.

  • >> but they should just migrate to an Oracle Database

    No big tech companies or unicorn type startups are using Oracle. Is your claim that they are all wrong?

    >> Some startup builds on Postgres then spends half their eng budget at the most critical growth time firefighting around its limits instead of scaling their business

    This is why I suggest starting with some kind of normal queue / stream mechanism and columnar DB if needed. It isn't even harder than using one DB, particularly if you are using niche features.

  • It actually depends on the workload.

    Sending webhooks, as an example, often has zero need to go back and update the database, but I've seen that exact example take down several different managed databases ( i.e., not just postgres ).

    • Yes that's true but in good implementations you will want to surface to the recipient via some dashboard if delivery consistently fails. So at some point a message on the exception queue will want to update the db.

  • > "We discovered X about Postgres" is a eng blog cliché by this point.

    It really is, and it’s often surprising to me how basic some of the issues are being discovered. Like Figma, when they waited a shocking amount of time add [0] PgBouncer and read replicas. This is such a well-trod path that it’s baffling to me why you wouldn’t add it once it’s clear you have a winning product. At the very least, PgBouncer (or PgCat, or any other connection pooler / proxying service) - it adds negligible cost per month (in comparison to DB read replicas) to run a couple of containers with a load balancer.

    Re: Oracle, as much as I despise the company for its litigious practices, I’ll hand it to you that the features your DB has are astonishing. RAC is absolutely incredible (on paper - I’ve never used it).

    [0]: https://www.figma.com/blog/how-figma-scaled-to-multiple-data...

  • if you need transaction across a queue into a normal SQL DB or similar I believe you are doing something very wrong.

    Sure you need transaction about processing things in a queue (mark as "taken out", but not yet remove then remove or "place back in (or into a failed messages inbox)" on timeout or similar can be _very_ important for queue systems.

    But the moment the "fail save if something dies while processing a message" becomes a directly coupled with DB transactions you have created something very brittle and cumbersome.

    To be fair that might still be the best solution for some situations.

    But the better solution is to make sure you treat a queue as message passing system and handle messages as messages with the appropriate delivery semantics. And if you can't because idk. idempotency logic is supper unreliable then there indeed is a problem, but its not in the missing cross transactions but how you write that logic (missing ?_tooling_, strict code guidelines people actually comply with, interface regression checks, tests (including prop/fuzz tests, regression tests, integration/e2e tests etc., not just "dump" unit test)).

    > just migrate to an Oracle Database.

    In my experience while Oracle DB is very powerful but also very cumbersome in a lot of ways and if you need thing only they can provide you most likely already fucked up big time somewhere else in your design/architecture. Sure if you are at that point Oracle can lightly be the cheaper solution. But still preferable you never endup there.

    As a side note, there are also a lot of decent plugins which can provide similar capabilities to PG, but they tend to have the issue that they aren't part of managed PG solutions and self managing PG (or most other reasonable powerful DB) can be a huge pain, and then yes Oracle can be a solution.

    Still the amount of startups which had a overall good experience are in my experience overall non existing in my experience. (But there are some pretty big companies/projects I know of which have a overall good experience with Oracle.)

    > constant stories of firefights

    If you mean stories on HN, than that isn't a meaningful metric, you will only hear about the "interesting" stories which mostly are about fire fighting or "using pg for everything is grate" but rarely the majority of in-between stories and boring silent successes. If it's about stories from you carriers and asking dev friends you have what their experience is then it is more meaningful. But in a bubble (like this answer of mine is, without question, in a bubble).

    Generally I think people really overestimate how representative HN is, idk. about the US but outside of it _huge_ parts of the IT industry are not represented by HN in any meaningful way. I would say in my country HN is _at most_ representative for 1/4 of the industry, but that 1/4 also contains many of the very very motivated software developers. But also very few of the "that my work but not my calling", "bread and butter" work software devs, which are often 1/3 to over 1/2 of devs in most countries as far as I can tell.

    • >> To be fair that might still be the best solution for some situations.

      It is arguable. Let's say your team knows Postgres well from a relational standpoint. Now they need to do something messages and require some type of notification / messaging. There is a learning curve here anyway. I'd argue they should spend it on more standard approaches which are not harder to start with. Of course, if you know that your site / service will only be used by yourself and your grandmother do whatever you want (just use a text file or better yet just call her instead).

    • >But the moment the "fail save if something dies while processing a message" becomes a directly coupled with DB transactions you have created something very brittle and cumbersome.

      The standard workflow for processing something from a queue is to keep track of all the messages you have already processed in the transactional database and simply request the remaining unprocessed messages. Often this is as simple as storing the last successfully processed message ID in the database and updating it in the same transaction that has processed the message. If an error occurs you roll the transaction back, which also rolls back the last message ID. The consumer will automatically re-request the failed message on the next attempt, giving you out of the box idempotency for at least once messaging.

      1 reply →