Comment by with

14 days ago

Great application of first principles. I think it's totally reasonable also, at even most production loads. (Example: My last workplace had a service that constantly roared at 30k events per second, and our DLQs would at most have orders of hundreds of messages in them). We would get paged if a message's age was older than an hour in the queue.

The idea is that if your DLQ has consistently high volume, there is something wrong with your upstream data, or data handling logic, not the architecture.

What did you use for the DLQ monitoring? And how did you fix the issues?

  • We strictly used AWS for everything and always preferred AWS-managed, so we always used SQS (and their built-in DLQ functionality). They made it easy to configure throttling, alerting, buffering, concurrency, retries etc, and you could easily use the UI to inspect the messages in a pinch.

    As far as fixing actual critical issues - usually the message inside the DLQ had a trace that was revealing enough, although not always so trivial.

    The philosophy was either: 1. fix the issue 2. swallow the issue (more rare)

    but make sure this message never comes back to DLQ again