Kubernetes集群完成的工作不稳定; 填充有“ http2:无可用的高速缓存的连接”的kubelet日志

Summary

I have various single-node Kubernetes clusters which become unstable after having accumulated ~300 completed jobs.

In one cluster, for example, there are 303 completed jobs:

root@xxxx:/home/xxxx# kubectl get jobs | wc -l
303

Observations

What I observe is that

  • The kubelet logs are filled with error messages like this: kubelet[877]: E0219 09:06:14.637045 877 reflector.go:134] object-"default"/"job-162273560": Failed to list *v1.ConfigMap: Get https://172.13.13.13:6443/api/v1/namespaces/default/configmaps?fieldSelector=metadata.name%3Djob-162273560&limit=500&resourceVersion=0: http2: no cached connection was available
  • The node status is not being updated, with a similar error message: kubelet[877]: E0219 09:32:57.379751 877 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/kubelet.go:451: Failed to list *v1.Node: Get https://172.13.13.13:6443/api/v1/nodes?fieldSelector=metadata.name%3Dxxxxx&limit=500&resourceVersion=0: http2: no cached connection was available
  • Eventually, the node is being marked as NotReady and no new pods are scheduled NAME STATUS ROLES AGE VERSION xxxxx NotReady master 6d4h v1.12.1
  • The cluster is entering and exiting the master disruption mode (from the kube-controller-manager logs): I0219 09:29:46.875397 1 node_lifecycle_controller.go:1015] Controller detected that all Nodes are not-Ready. Entering master disruption mode. I0219 09:30:16.877715 1 node_lifecycle_controller.go:1042] Controller detected that some Nodes are Ready. Exiting master disruption mode.

The real culprit appears to be the http2: no cached connection was available error message. The only real references I've found are a couple of issues in the Go repository (like #16582), which appear to have been fixed a long time ago.

In most cases, deleting the completed jobs seems to restore the system stability.

Minimal repro (tbc)

I seem to be able to reproduce this problem by creating lots of jobs which use containers which mount ConfigMaps:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: job-%JOB_ID%
data:
# Just some sample data
  game.properties: |
    enemies=aliens
    lives=3
    enemies.cheat=true
    enemies.cheat.level=noGoodRotten
    secret.code.passphrase=UUDDLRLRBABAS
    secret.code.allowed=true
    secret.code.lives=30
  ui.properties: |
    color.good=purple
    color.bad=yellow
    allow.textmode=true
    how.nice.to.look=fairlyNice
---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-%JOB_ID%
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(20)"]
        volumeMounts:
        - name: config-volume
          mountPath: /etc/config
      volumes:
        - name: config-volume
          configMap:
            name: job-%JOB_ID%
      restartPolicy: Never
  backoffLimit: 4

Schedule lots of these jobs:

#!/bin/bash
for i in `seq 100 399`;
do
    cat job.yaml | sed "s/%JOB_ID%/$i/g" | kubectl create -f -
    sleep 0.1
done

Questions

I'm very curious though as to what causes this problem, as 300 completed jobs seems to be a fairly low number.

Is this a configuration problem in my cluster? A possible bug in Kubernetes/Go? Anything else that I can try?

douzuanze0486
douzuanze0486 github.com/kubernetes/kubernetes/issues/74302和github.com/kubernetes/kubernetes/issues/74412对此进行了更详细的描述。
一年多之前 回复
doulin4844
doulin4844 奇怪的是,我可以在无业游民的环境中始终如一地重现这一点。我已经将用于配置环境的脚本上传到了github.com/qmfrederik/k8s-job-repro。它使用自定义的Ansible角色来配置VM,但这是非常标准的Kubernetes。我注意到该问题不会在Kubernetesv1.11中重现,但在v1.12和v1.13中会重现。
一年多之前 回复
douge3830
douge3830 我按照您的步骤进行操作,在任何情况下都不会出现任何错误。将Minikubev1.10,ComputeEnginev1.13和CEv1.12.1都用作单节点群集(主节点是节点)。我在minikube中发现的唯一问题是服务器错误:创建“STDIN”时出错:etcdserver:请求超时,但我能够轻松完成400、500个作业。因此,还有其他问题,您能否提供更多详细信息-集群中还有什么其他情况?您如何部署它以及在何处部署?您有任何容忍,要求/限制吗?还请提供kubectl描述节点master_name
一年多之前 回复
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
立即提问