Comment by rbranson
9 hours ago
Biggest thing to watch out with this approach is that you will inevitably have some failure or bug that will 10x, 100x, or 1000x the rate of dead messages and that will overload your DLQ database. You need a circuit breaker or rate limit on it.
This is the same risk with any DLQ.
The idea behind a DLQ is it will retry (with some backoff) eventually, and if it fails enough, it will stay there. You need monitoring to observe the messages that can't escape DLQ. Ideally, nothing should ever stay in DLQ, and if it does, it's something that should be fixed.
I worked on an app that sent an internal email with stack trace whenever an unhandled exception occurred. Worked great until the day when there was an OOM in a tight loop on a box in Asia that sent a few hundred emails per second and saturated the company WAN backbone and mailboxes of the whole team. Good times.
This! Only thing worse than your main queue backing off is you dropping items from going into the DLQ because it can’t stay up.
If you can’t deliver to the DLQ, then what? Then you’re missing messages either way. Who cares if it’s down this way or the other?
Not necessarily. If you can't deliver the message somewhere you don't ACK it, and the sender can choose what to do (retry, backoff, etc.)
Sure, it's unavailability of course, but it's not data loss.
If you are reading from Kafka (for example) and you can't do anything with a message (broken json as an example) and you can't put it into a DLQ - you have not other option but to skip it or stop on it, no?
3 replies →
The point is to not take the whole server down with it. Keeps the other applications working.
Sure, but you still need to design around this problem. It’s going to be a happy accident that everything turns out fine if you don’t.
It will happen eventually in any system.
No need to look down on PG because it makes it more approachable and is more longer a specialized skill.