I have a server-side app written in Go producing Kafka events. It runs perfectly for days, producing ~1.6k msg/sec, and then hits a sporadic problem, where all Kafka message sending stops, and the server app needs to be manually restarted for Kafka messages to resume sending.
I've included a screenshot of the metric graphs when the incident started. To annotate what I see happening:
For seven days, the app runs perfectly. For every message queued, there is a delivery event notification sent to
kafkaProducer.Events()
. You can see that num queued = num delivered.10:39: The issue starts. The delivery notification count quickly drops to zero. Kafka messages keep getting queued, but the callbacks stop.
10:52:
kafkaProducer.ProduceChannel()
is filled up, and attempts to queue new messsages into the go channel block the goroutine. At this point the app will never send another Kafka message again until it is manually restarted.17:55: I manually restarted the application. kafka message queue/delivery resumes. kafka_produce_attempts drops back to zero.
The one and only place my Go code sends Kafka messages is here:
recordChannelGauge.Inc()
kafkaProducer.ProduceChannel() <- &msg
recordChannelGauge.Dec()
In the metric screenshot, note that recordChannelGauge
normally stays at zero because sending the message to the Kafka ProduceChannel()
doesn't block and each Inc()
is immediately followed by a matching Dec()
However, when the ProduceChannel()
is filled up, the goroutine blocks and recordChannelGauge
stays at 1 and will never unblock until the app is manually restarted.
FYI, my environment details:
- Go server binary built with golang 1.10.x
- Latest version of
github.com/confluentinc/confluent-kafka-go/kafka
. This library doesn't use versions, it's using the latest git commit, which as of this writing is 2 months old, so I'm sure I'm using that latest version. - Server OS Ubuntu 16.04.5
- librdkafka1 version librdka0.11.6~1confluent5.0.1-
I suspect this is due to some internal problem in the confluentinc go client, where it doesn't handle some error scenario appropriately.
Also, I see no relevant log output around the time of the problem. I do see sporadic Kafka broker disconnected and time out errors in the logs before the problem happened that don't seem to be serious. These log messages happened every few hours or so for days without serious consequence.
Nov 26 06:52:04 01 appserver.linux[6550]: %4|1543215124.447|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 06:52:10 01 appserver.linux[6550]: %4|1543215130.448|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:46:57 01 appserver.linux[6550]: 2018/11/26 08:46:57 Ignored event: kafka-broker-2:9092/bootstrap: Disconnected (after 600000ms in state UP)
Nov 26 08:47:02 01 appserver.linux[6550]: %4|1543222022.803|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:47:09 01 appserver.linux[6550]: %4|1543222029.807|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests