转到Kafka`ProduceChannel（）`填充并挂起

I have a server-side app written in Go producing Kafka events. It runs perfectly for days, producing ~1.6k msg/sec, and then hits a sporadic problem, where all Kafka message sending stops, and the server app needs to be manually restarted for Kafka messages to resume sending.

I've included a screenshot of the metric graphs when the incident started. To annotate what I see happening:

For seven days, the app runs perfectly. For every message queued, there is a delivery event notification sent to kafkaProducer.Events(). You can see that num queued = num delivered.
10:39: The issue starts. The delivery notification count quickly drops to zero. Kafka messages keep getting queued, but the callbacks stop.
10:52: kafkaProducer.ProduceChannel() is filled up, and attempts to queue new messsages into the go channel block the goroutine. At this point the app will never send another Kafka message again until it is manually restarted.
17:55: I manually restarted the application. kafka message queue/delivery resumes. kafka_produce_attempts drops back to zero.

The one and only place my Go code sends Kafka messages is here:

    recordChannelGauge.Inc()
    kafkaProducer.ProduceChannel() <- &msg
    recordChannelGauge.Dec()

In the metric screenshot, note that recordChannelGauge normally stays at zero because sending the message to the Kafka ProduceChannel() doesn't block and each Inc() is immediately followed by a matching Dec() However, when the ProduceChannel() is filled up, the goroutine blocks and recordChannelGauge stays at 1 and will never unblock until the app is manually restarted.

FYI, my environment details:

Go server binary built with golang 1.10.x
Latest version of github.com/confluentinc/confluent-kafka-go/kafka. This library doesn't use versions, it's using the latest git commit, which as of this writing is 2 months old, so I'm sure I'm using that latest version.
Server OS Ubuntu 16.04.5
librdkafka1 version librdka0.11.6~1confluent5.0.1-

I suspect this is due to some internal problem in the confluentinc go client, where it doesn't handle some error scenario appropriately.

Also, I see no relevant log output around the time of the problem. I do see sporadic Kafka broker disconnected and time out errors in the logs before the problem happened that don't seem to be serious. These log messages happened every few hours or so for days without serious consequence.

Nov 26 06:52:04 01 appserver.linux[6550]: %4|1543215124.447|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 06:52:10 01 appserver.linux[6550]: %4|1543215130.448|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-3:9092/bootstrap]: kafka-broker-3:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:46:57 01 appserver.linux[6550]: 2018/11/26 08:46:57 Ignored event: kafka-broker-2:9092/bootstrap: Disconnected (after 600000ms in state UP)
Nov 26 08:47:02 01 appserver.linux[6550]: %4|1543222022.803|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
Nov 26 08:47:09 01 appserver.linux[6550]: %4|1543222029.807|REQTMOUT|rdkafka#producer-1| [thrd:kafka-broker-2:9092/bootstrap]: kafka-broker-2:9092/bootstrap: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests

Zoomed in to problem occurrence

Zoomed out to show before and after

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

报告相同问题？

关注问题

kafka拉取不到数据 java kafka 分布式有问必答
2022-02-16 19:32

回答 4 已采纳重新开一个topic，然后先启动consumer，再启动producer，再发消息，你这个可能是consumer已经在broker里有了自己的offset，就会读不到之前producer发送到brok
sqoop和kettle可以实现oracle到kafka嘛 hbase sqoop 大数据
2022-08-08 17:40

回答 1 已采纳都可以抽Oracle写Kafka重点是你要怎么做数据抽取，增量变更捕获或者就是全量抽
kafka中的 zookeeper 起到什么作用，可以不用zookeeper么？ kafka zookeeper 后端
2021-09-24 21:33

回答 1 已采纳作用其实就是你理解的那样，zk就是协调的组件，但是不用它的话，也需要实现类似的功能。也有其他mq不用ZK也实现了同样的功能
Kafka某节点挂掉如何解决
2022-09-08 23:02

aimeralan01的博客 Kafka某节点挂掉如何解决
kafka如何获取到发送失败的数据 java kafka 有问必答
2021-07-27 10:36

回答 1 已采纳 kafka调用send方法发送数据后会返回一个ListenableFuture 这个ListenableFuture可以添加callback进行监听d
spring+kafka+hive实现动作提取到数据仓库 hive kafka spring
2022-07-13 14:53

回答 1 已采纳尚硅谷的电商案例里面有介绍
zookeeper 设置 Acl 后 kafka 启动失败 kafka zookeeper 分布式
2022-09-26 17:35

回答 1 已采纳网上都有相关文章，例如，你可以参考这篇：https://www.bilibili.com/read/cv11773508里面就有涉及到这两个的安全认证配置。
简单易懂的Kafka安装指南：从下载到配置
2021-11-24 19:42

Java程序员廖志伟的博客文章目录一、安装JDK 二、安装Zookeeper 三、安装Kafka 四、启动并验证kafka 启动kafka 进入zookeeper目录通过zookeeper客户端查看下zookeeper的目录树校验kafka 创建主题查看kafka中目前存在的topic 发送消息 ...
kafka取不到监听数据 jar 有问必答
2021-04-20 10:49

回答 3 已采纳 https://gewu.pcwanli.com/front/article/10619.html
在win10启动kafka报错 kafka zookeeper
2022-05-15 21:27

回答 1 已采纳删除异常的主题“movie_real_topic”试试.\bin\kafka-topics --delete --topic movie_real_topic --zookeeper localhos
kafka无法正常启动 kafka
2022-11-14 16:38

回答 2 已采纳帮你找了个相似的问题, 你可以看下: https://ask.csdn.net/questions/7763153这篇博客你也可以参考下：搭建kafka环境遇到kafka闪退问题以及解决办法这篇博客也
Kafka一个节点挂掉，导致服务不可消费
2022-09-14 15:18

李振伟的博客 kafka集群，一个节点挂掉，导致不可消费； kafka修改默认副本数； kafka为现有topic扩副本。
Springboot集成kafka，应用很卡，消费很慢 java kafka spring boot
2022-10-25 09:26

回答 5 已采纳感觉消息堆积有点厉害，查一下代码，是因为什么原因导致消息一直没被消费。如果只是前端数据，可以丢弃的话，把队列清空，看看还会不会卡？
【微服务】springboot整合kafka-stream使用详解
2023-12-24 16:46

小码农叔叔的博客 kafka stream使用详解
kafka挂掉原因
2021-03-31 19:54

肥宅小谢的博客看在logs中meta.properties中的broker.id和你在/opt/kafka/config/server.properties中的broker.id是否一致。 1.先进入kafka目录下，查看config/server.properties中的broker.id 2.查看meta.properties中的...
没有解决我的问题, 去提问

悬赏问题

¥30 这是哪个作者做的宝宝起名网站
¥60 版本过低apk如何修改可以兼容新的安卓系统
¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
¥50 有数据，怎么建立模型求影响全要素生产率的因素
¥50 有数据，怎么用matlab求全要素生产率
¥15 TI的insta-spin例程
¥15 完成下列问题完成下列问题
¥15 C#算法问题, 不知道怎么处理这个数据的转换
¥15 YoloV5 第三方库的版本对照问题
¥15 请完成下列相关问题！

码龄粉丝数原力等级 --

转到Kafka`ProduceChannel（）`填充并挂起

Zoomed in to problem occurrence

Zoomed out to show before and after

0条回答默认最新

悬赏问题

转到Kafka`ProduceChannel（）`填充并挂起

Zoomed in to problem occurrence

Zoomed out to show before and after

0条回答 默认 最新

悬赏问题

0条回答默认最新