Kubernetes集群完成的工作不稳定；填充有“ http2：无可用的高速缓存的连接”的kubelet日志

Summary

I have various single-node Kubernetes clusters which become unstable after having accumulated ~300 completed jobs.

In one cluster, for example, there are 303 completed jobs:

root@xxxx:/home/xxxx# kubectl get jobs | wc -l
303

Observations

What I observe is that

The kubelet logs are filled with error messages like this: kubelet[877]: E0219 09:06:14.637045 877 reflector.go:134] object-"default"/"job-162273560": Failed to list *v1.ConfigMap: Get https://172.13.13.13:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Djob-162273560&limit=500&resourceVersion=0: http2: no cached connection was available
The node status is not being updated, with a similar error message: kubelet[877]: E0219 09:32:57.379751 877 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: Get https://172.13.13.13:6443/api/v1/nodes?fieldSelector=metadata.name%3Dxxxxx&limit=500&resourceVersion=0: http2: no cached connection was available
Eventually, the node is being marked as NotReady and no new pods are scheduled NAME STATUS ROLES AGE VERSION xxxxx NotReady master 6d4h v1.12.1
The cluster is entering and exiting the master disruption mode (from the kube-controller-manager logs): I0219 09:29:46.875397 1 node_lifecycle_controller.go:1015] Controller detected that all Nodes are not-Ready. Entering master disruption mode. I0219 09:30:16.877715 1 node_lifecycle_controller.go:1042] Controller detected that some Nodes are Ready. Exiting master disruption mode.

The real culprit appears to be the http2: no cached connection was available error message. The only real references I've found are a couple of issues in the Go repository (like #16582), which appear to have been fixed a long time ago.

In most cases, deleting the completed jobs seems to restore the system stability.

Minimal repro (tbc)

I seem to be able to reproduce this problem by creating lots of jobs which use containers which mount ConfigMaps:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: job-%JOB_ID%
data:
# Just some sample data
  game.properties: |
    enemies=aliens
    lives=3
    enemies.cheat=true
    enemies.cheat.level=noGoodRotten
    secret.code.passphrase=UUDDLRLRBABAS
    secret.code.allowed=true
    secret.code.lives=30
  ui.properties: |
    color.good=purple
    color.bad=yellow
    allow.textmode=true
    how.nice.to.look=fairlyNice
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-%JOB_ID%
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(20)"]
        volumeMounts:
        - name: config-volume
          mountPath: /etc/config
      volumes:
        - name: config-volume
          configMap:
            name: job-%JOB_ID%
      restartPolicy: Never
  backoffLimit: 4

Schedule lots of these jobs:

#!/bin/bash
for i in `seq 100 399`;
do
    cat job.yaml | sed "s/%JOB_ID%/$i/g" | kubectl create -f -
    sleep 0.1
done

Questions

I'm very curious though as to what causes this problem, as 300 completed jobs seems to be a fairly low number.

Is this a configuration problem in my cluster? A possible bug in Kubernetes/Go? Anything else that I can try?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

报告相同问题？

关注问题

kubectl logs命令不允许查看日志 kubernetes
2022-09-30 18:04

回答 1 已采纳已经解决了，经过一下午的问题排查终于彻底解决了
从Golang连接到kubernetes-aerospike网格集群 kubernetes
2018-04-20 14:00

回答 1 已采纳 headless service clusterIP is "none" This simply means we don't use the load balancer/revers
k8s部署Redis集群，节点重启pod IP变化，Java代码不能自动连接到新的pod IP，如何自动刷新？ kubernetes redis spring boot
2021-09-27 22:39

回答 2 已采纳 Redis Cluster · lettuce-io/lettuce-core Wiki · GitHub Advanced Java R
Kubernetes1.27.1部署+containerd+CFSSL证书+批量高可用集群部署
2023-05-17 15:27

湘-石的博客 Kubernetes1.27.1部署+containerd+CFSSL证书+批量集群部署
kubernetes部署nginx，在浏览器中无法访问？ kubernetes 运维
2023-03-15 20:06

回答 2 已采纳看下你的地址还有端口是不是正常的。
GBase 8a集群启动日志报错Can’t start server: Bind on unix socket: Permission denied 数据库
2022-01-28 10:43

回答 1 已采纳 GBase 8a数据库集群，默认使用/tm/gbase_8a_5050.sock作为socker文件，如果该目录没有访问权限，则会报an’t start server: Bind on unix so
关于#redis#的问题：redis 哨兵集群(操作系统-linux) linux redis 运维
2023-01-09 17:33

回答 2 已采纳回答不易，求求您点赞采纳哦是的，Redis Sentinel 可用于直接连接到 Redis 集群。Sentinel 旨在为 Redis 提供高可用性，如果主 Redis 实例变得不可用，它会自动
【Redis】高可用之三：集群（cluster）
2023-07-19 16:48

AQin1012的博客官网地址由于数据量过大，单个Master复制集难以承担，因此需要对多个复制集进行集群，形成水平扩展。每个数据集只负责存储整个数据集的一部分，这就是Redis的集群，其作用是提供在多个Redis节点间共享数据的程序集。...
k8s NodePort端口连不上 kubernetes 运维运维开发
2022-09-05 17:27

回答 2 已采纳 telnet 10.122.1.24 8080 看看jenkins 起来了吗，或者 curl一下 10.122.1.24:8080
K8S kafka集群外部访问的问题 kafka kubernetes 运维
2022-07-21 16:59

回答 1 已采纳内部是9092端口，外部用31082端口
Spring Boot进阶(97)：从入门到精通：Spring Boot整合Kubernetes详细教程
2023-10-31 08:00

bug菌¹的博客从入门到精通：Spring Boot整合Kubernetes详细教程，带你一文搞定。
项目启动连不上nacos集群，访问nacos也不通，有什么解决办法吗？ java 后端
2021-11-21 14:38

回答 1 已采纳服务器用的nginx还是appachi
亿级流量电商详情页系统实战-缓存架构+高可用服务架构+微服务架构
2021-07-13 16:26

2、基于更加完整的业务架构来讲解，从最源头的商品服务、价格服务、库存服务开始，从业务数据的变更到缓存数据的生产，将整个商品详情页系统架构串联起来。虽然上游服务的业务还是做了大幅度的简化，但是业务架构...
大学四年，工作2年我总结了后端面试的所有知识点（持续更新）
2020-05-08 11:41

敖丙的博客 zoo有leader和follower角色，eur各个节点平等 zoo采用半数存活原则（避免脑裂），eur采用自我保护机制来解决分区问题 eur本质是个工程，zoo只是一个进程 ZooKeeper基于CP，不保证高可用，如果zookeeper正在选主，...
没有解决我的问题, 去提问

悬赏问题

¥15 素材场景中光线烘焙后灯光失效
¥15 请教一下各位，为什么我这个没有实现模拟点击
¥15 执行 virtuoso 命令后，界面没有，cadence 启动不起来
¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
¥20 有关区间dp的问题求解
¥15 多电路系统共用电源的串扰问题
¥15 slam rangenet++配置
¥15 有没有研究水声通信方面的帮我改俩matlab代码
¥15 ubuntu子系统密码忘记
¥15 保护模式-系统加载-段寄存器

码龄粉丝数原力等级 --

Kubernetes集群完成的工作不稳定；填充有“ http2：无可用的高速缓存的连接”的kubelet日志

Summary

Observations

Minimal repro (tbc)

Questions

0条回答默认最新

悬赏问题

Kubernetes集群完成的工作不稳定； 填充有“ http2：无可用的高速缓存的连接”的kubelet日志

Summary

Observations

Minimal repro (tbc)

Questions

0条回答 默认 最新

悬赏问题

Kubernetes集群完成的工作不稳定；填充有“ http2：无可用的高速缓存的连接”的kubelet日志

0条回答默认最新