← Back to context

Comment by xyzzy_plugh

8 hours ago

Not necessarily. If you can't deliver the message somewhere you don't ACK it, and the sender can choose what to do (retry, backoff, etc.)

Sure, it's unavailability of course, but it's not data loss.

If you are reading from Kafka (for example) and you can't do anything with a message (broken json as an example) and you can't put it into a DLQ - you have not other option but to skip it or stop on it, no?

  • Your place of last resort with kafka is simply to replay the message back to the same kafka topic since you know it's up. In a simple single consumer setup just throw a retry count on the message and increment it to get monitoring/alerting/etc. Multi consumer? Put an enqueue source tag on it and only process the messages tagged for you. This won't scale to infinity but it scales really really far for really really cheap

  • Generally yes, but if you use e.g. the parallel consumer, you can potentially keep processing in that partition to avoid head-of-line blocking. There are some downsides to having a very old unprocessed record since it won't advance the consumer group's offset past that record, and it instead keeps track of the individual offsets it has completed beyond it, so you don't want to be in that state indefinitely, but you hope your DLQ eventually succeeds.

    But if your DLQ is overloaded, you probably want to slow down or stop since sending a large fraction of your traffic to DLQ is counter productive. E.g. if you are sending 100% of messages to DLQ due to a bug, you should stop processing, fix the bug, and then resume from your normal queue.

  • Sorry, but what's stopping the DLQ being a different topic on that Kafka - I get that the consumer(s) might be dead, preventing them from moving the message to the DLQ topic, but if that's the case then no messages are being consumed at all.

    If the problem is that the consumers themselves cannot write to the DLQ, then that feels like either Kafka is dying (no more writes allowed) or the consumers have been misconfigured.

    Edit: In fact there seems to be a self inflicted problem being created here - having the DLQ on a different system, whether it be another instance of Kafka, or Postgres, or what have you, is really just creating another point of failure.