douyanyan1123 2018-11-26 20:33
浏览 167

转到Kafka`ProduceChannel()`填充并挂起

I have a server-side app written in Go producing Kafka events. It runs perfectly for days, producing ~1.6k msg/sec, and then hits a sporadic problem, where all Kafka message sending stops, and the server app needs to be manually restarted for Kafka messages to resume sending.

I've included a screenshot of the metric graphs when the incident started. To annotate what I see happening:

  • For seven days, the app runs perfectly. For every message queued, there is a delivery event notification sent to kafkaProducer.Events(). You can see that num queued = num delivered.

  • 10:39: The issue starts. The delivery notification count quickly drops to zero. Kafka messages keep getting queued, but the callbacks stop.

  • 10:52: kafkaProducer.ProduceChannel() is filled up, and attempts to queue new messsages into the go channel block the goroutine. At this point the app will never send another Kafka message again until it is manually restarted.

  • 17:55: I manually restarted the application. kafka message queue/delivery resumes. kafka_produce_attempts drops back to zero.

The one and only place my Go code sends Kafka messages is here:

    recordChannelGauge.Inc()
    kafkaProducer.ProduceChannel() <- &msg
    recordChannelGauge.Dec()

In the metric screenshot, note that recordChannelGauge normally stays at zero because sending the message to the Kafka ProduceChannel() doesn't block and each Inc() is immediately followed by a matching Dec() However, when the ProduceChannel() is filled up, the goroutine blocks and recordChannelGauge stays at 1 and will never unblock until the app is manually restarted.

FYI, my environment details:

  • Go server binary built with golang 1.10.x
  • Latest version of github.com/confluentinc/confluent-kafka-go/kafka. This library doesn't use versions, it's using the latest git commit, which as of this writing is 2 months old, so I'm sure I'm using that latest version.
  • Server OS Ubuntu 16.04.5
  • librdkafka1 version librdka0.11.6~1confluent5.0.1-

I suspect this is due to some internal problem in the confluentinc go client, where it doesn't handle some error scenario appropriately.

Also, I see no relevant log output around the time of the problem. I do see sporadic Kafka broker disconnected and time out errors in the logs before the problem happened that don't seem to be serious. These log messages happened every few hours or so for days without serious consequence.

Nov 26 06:52:04 01 appserver.linux[6550]: %4|1543215124.447|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 06:52:10 01 appserver.linux[6550]: %4|1543215130.448|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:46:57 01 appserver.linux[6550]: 2018/11/26 08:46:57 Ignored event: kafka-broker-2:9092/bootstrap: Disconnected (after 600000ms in state UP)
Nov 26 08:47:02 01 appserver.linux[6550]: %4|1543222022.803|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:47:09 01 appserver.linux[6550]: %4|1543222029.807|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests

Zoomed in to problem occurrence

Zoomed in to problem occurrence

Zoomed out to show before and after

Zoomed out to show before and after

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥30 这是哪个作者做的宝宝起名网站
    • ¥60 版本过低apk如何修改可以兼容新的安卓系统
    • ¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
    • ¥50 有数据,怎么建立模型求影响全要素生产率的因素
    • ¥50 有数据,怎么用matlab求全要素生产率
    • ¥15 TI的insta-spin例程
    • ¥15 完成下列问题完成下列问题
    • ¥15 C#算法问题, 不知道怎么处理这个数据的转换
    • ¥15 YoloV5 第三方库的版本对照问题
    • ¥15 请完成下列相关问题!