Comment by konart

24 days ago

If you are reading from Kafka (for example) and you can't do anything with a message (broken json as an example) and you can't put it into a DLQ - you have not other option but to skip it or stop on it, no?

6 comments

konart

Misdicorl 24 days ago

Your place of last resort with kafka is simply to replay the message back to the same kafka topic since you know it's up. In a simple single consumer setup just throw a retry count on the message and increment it to get monitoring/alerting/etc. Multi consumer? Put an enqueue source tag on it and only process the messages tagged for you. This won't scale to infinity but it scales really really far for really really cheap

singron 24 days ago

Generally yes, but if you use e.g. the parallel consumer, you can potentially keep processing in that partition to avoid head-of-line blocking. There are some downsides to having a very old unprocessed record since it won't advance the consumer group's offset past that record, and it instead keeps track of the individual offsets it has completed beyond it, so you don't want to be in that state indefinitely, but you hope your DLQ eventually succeeds.

But if your DLQ is overloaded, you probably want to slow down or stop since sending a large fraction of your traffic to DLQ is counter productive. E.g. if you are sending 100% of messages to DLQ due to a bug, you should stop processing, fix the bug, and then resume from your normal queue.

awesome_dude 24 days ago

Sorry, but what's stopping the DLQ being a different topic on that Kafka - I get that the consumer(s) might be dead, preventing them from moving the message to the DLQ topic, but if that's the case then no messages are being consumed at all.

If the problem is that the consumers themselves cannot write to the DLQ, then that feels like either Kafka is dying (no more writes allowed) or the consumers have been misconfigured.

Edit: In fact there seems to be a self inflicted problem being created here - having the DLQ on a different system, whether it be another instance of Kafka, or Postgres, or what have you, is really just creating another point of failure.

majormajor 24 days ago
> Edit: In fact there seems to be a self inflicted problem being created here - having the DLQ on a different system, whether it be another instance of Kafka, or Postgres, or what have you, is really just creating another point of failure.
There's a balance. Do you want to have your Kafka cluster provisioned for double your normal event intake rate just in case you have the worst-case failure to produce elsewhere that causes 100% of events to get DLQ'd (since now you've doubled your writes to the shared cluster, which could cause failures to produce to the original topic).
In that sort of system, failing to produce to the original topic is probably what you want to avoid most. If your retention period isn't shorter than your time to recover from an incident like that, then priority 1 is often "make sure the events are recorded so they can be processed later."
IMO a good architecture here cleanly separates transient failures (don't DLQ; retry with backoff, don't advance consumer group) from "permanently cannot process" (DLQ only these), unlike in the linked article. That greatly reduces the odds of "everything is being DLQ'd!" causing cascading failures from overloading seldom-stressed parts of the system. Makes it much easier to keep your DLQ in one place, and you can solve some of the visibility problems from the article from a consumer that puts summary info elsewhere or such. There's still a chance for a bug that results in everything being wrongly rejected, but it makes you potentially much more robust against transient downstream deps having a high blast radius. (One nasty case here is if different messages have wildly different sets of downstream deps, do you want some blocking all the others then? IMO they should then be partitioned in a way so that you can still move forward on the others.)
- awesome_dude 24 days ago
  
  I think that you're right to mention that if the DLQ is over used that that potentially cripples the whole event broker, but I don't think that having a second system that could fall over for the same reason AND a host of other reasons is a good plan. FTR I think doubling kafka provisioned capacity is simpler, easier, cheaper, and more reliable approach.
  BUT, you are 100% right to point to what i think is the proper solution, and that is to treat the DLQ with some respect, not a bit bucket where things get dumped because the wind isn't blowing in the right direction.
- diljitpr 22 days ago
  
  I am the author of th is article.Thank you for reading and important the insight and I second your opinion about DLQ flooding. We have the following strategy configured in our consumers to avoid DLQ flooding
  ExponentialBackOffWithMaxRetries backOff = new ExponentialBackOffWithMaxRetries(3);
  backOff.setInitialInterval(2000L); // 2 seconds initial delay
  backOff.setMultiplier(2.0); // Exponential backoff
  backOff.setMaxInterval(30000L); // Max delay 30 seconds
  backOff.setMaxAttempts(3); // Retry 3 times before DLQ
  Updated this in my blog.