Kubernetes集群完成的工作不稳定；填充有“ http2：无可用的高速缓存的连接”的kubelet日志

Summary

I have various single-node Kubernetes clusters which become unstable after having accumulated ~300 completed jobs.

In one cluster, for example, there are 303 completed jobs:

root@xxxx:/home/xxxx# kubectl get jobs | wc -l
303

Observations

What I observe is that

The kubelet logs are filled with error messages like this: kubelet[877]: E0219 09:06:14.637045 877 reflector.go:134] object-"default"/"job-162273560": Failed to list *v1.ConfigMap: Get https://172.13.13.13:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Djob-162273560&limit=500&resourceVersion=0: http2: no cached connection was available
The node status is not being updated, with a similar error message: kubelet[877]: E0219 09:32:57.379751 877 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: Get https://172.13.13.13:6443/api/v1/nodes?fieldSelector=metadata.name%3Dxxxxx&limit=500&resourceVersion=0: http2: no cached connection was available
Eventually, the node is being marked as NotReady and no new pods are scheduled NAME STATUS ROLES AGE VERSION xxxxx NotReady master 6d4h v1.12.1
The cluster is entering and exiting the master disruption mode (from the kube-controller-manager logs): I0219 09:29:46.875397 1 node_lifecycle_controller.go:1015] Controller detected that all Nodes are not-Ready. Entering master disruption mode. I0219 09:30:16.877715 1 node_lifecycle_controller.go:1042] Controller detected that some Nodes are Ready. Exiting master disruption mode.

The real culprit appears to be the http2: no cached connection was available error message. The only real references I've found are a couple of issues in the Go repository (like #16582), which appear to have been fixed a long time ago.

In most cases, deleting the completed jobs seems to restore the system stability.

Minimal repro (tbc)

I seem to be able to reproduce this problem by creating lots of jobs which use containers which mount ConfigMaps:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: job-%JOB_ID%
data:
# Just some sample data
  game.properties: |
    enemies=aliens
    lives=3
    enemies.cheat=true
    enemies.cheat.level=noGoodRotten
    secret.code.passphrase=UUDDLRLRBABAS
    secret.code.allowed=true
    secret.code.lives=30
  ui.properties: |
    color.good=purple
    color.bad=yellow
    allow.textmode=true
    how.nice.to.look=fairlyNice
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-%JOB_ID%
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(20)"]
        volumeMounts:
        - name: config-volume
          mountPath: /etc/config
      volumes:
        - name: config-volume
          configMap:
            name: job-%JOB_ID%
      restartPolicy: Never
  backoffLimit: 4

Schedule lots of these jobs:

#!/bin/bash
for i in `seq 100 399`;
do
    cat job.yaml | sed "s/%JOB_ID%/$i/g" | kubectl create -f -
    sleep 0.1
done

Questions

I'm very curious though as to what causes this problem, as 300 completed jobs seems to be a fairly low number.

Is this a configuration problem in my cluster? A possible bug in Kubernetes/Go? Anything else that I can try?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

报告相同问题？

关注问题

悬赏问题

¥30 深度学习，前后端连接
¥15 孟德尔随机化结果不一致
¥15 apm2.8飞控罗盘bad health，加速度计校准失败
¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
¥15 谁有desed数据集呀
¥20 手写数字识别运行c仿真时，程序报错错误代码sim211-100
¥15 关于#hadoop#的问题
¥15 (标签-Python|关键词-socket)
¥15 keil里为什么main.c定义的函数在it.c调用不了
¥50 切换TabTip键盘的输入法

码龄粉丝数原力等级 --

Kubernetes集群完成的工作不稳定；填充有“ http2：无可用的高速缓存的连接”的kubelet日志

Summary

Observations

Minimal repro (tbc)

Questions

0条回答

悬赏问题

Kubernetes集群完成的工作不稳定； 填充有“ http2：无可用的高速缓存的连接”的kubelet日志

Summary

Observations

Minimal repro (tbc)

Questions

0条回答

悬赏问题

Kubernetes集群完成的工作不稳定；填充有“ http2：无可用的高速缓存的连接”的kubelet日志