Comment by with

1 month ago

We strictly used AWS for everything and always preferred AWS-managed, so we always used SQS (and their built-in DLQ functionality). They made it easy to configure throttling, alerting, buffering, concurrency, retries etc, and you could easily use the UI to inspect the messages in a pinch.

As far as fixing actual critical issues - usually the message inside the DLQ had a trace that was revealing enough, although not always so trivial.

The philosophy was either: 1. fix the issue 2. swallow the issue (more rare)

but make sure this message never comes back to DLQ again