Comment by rbranson

17 days ago

Biggest thing to watch out with this approach is that you will inevitably have some failure or bug that will 10x, 100x, or 1000x the rate of dead messages and that will overload your DLQ database. You need a circuit breaker or rate limit on it.

18 comments

rbranson

rr808 17 days ago

I worked on an app that sent an internal email with stack trace whenever an unhandled exception occurred. Worked great until the day when there was an OOM in a tight loop on a box in Asia that sent a few hundred emails per second and saturated the company WAN backbone and mailboxes of the whole team. Good times.

with 17 days ago

This is the same risk with any DLQ.

The idea behind a DLQ is it will retry (with some backoff) eventually, and if it fails enough, it will stay there. You need monitoring to observe the messages that can't escape DLQ. Ideally, nothing should ever stay in DLQ, and if it does, it's something that should be fixed.

microlatency 16 days ago
What do you use for the monitoring of DLQs?
- with 16 days ago
  
  At my last workplace, we used pure AWS Cloudwatch. At my new workplace, we use Grafana+Sentry

shayonj 17 days ago

This! Only thing worse than your main queue backing off is you dropping items from going into the DLQ because it can’t stay up.

pletnes 17 days ago

If you can’t deliver to the DLQ, then what? Then you’re missing messages either way. Who cares if it’s down this way or the other?

xyzzy_plugh 17 days ago
Not necessarily. If you can't deliver the message somewhere you don't ACK it, and the sender can choose what to do (retry, backoff, etc.)
Sure, it's unavailability of course, but it's not data loss.
- konart 17 days ago
  
  If you are reading from Kafka (for example) and you can't do anything with a message (broken json as an example) and you can't put it into a DLQ - you have not other option but to skip it or stop on it, no?
  
  6 replies →
RedShift1 17 days ago

The point is to not take the whole server down with it. Keeps the other applications working.
rbranson 17 days ago

Sure, but you still need to design around this problem. It’s going to be a happy accident that everything turns out fine if you don’t.

plaguuuuuu 16 days ago

Could one put the DLQ messages on a queue and have a consumer ingest into pg?

(The queue probably isnt down if you've just pulled a message off it)

j45 17 days ago

It will happen eventually in any system.

No need to look down on PG because it makes it more approachable and is more longer a specialized skill.