weixin_39862985 2020-11-22 01:37
浏览 0

intermittent container networking errors when backed by containerd

E2E tests regularly run this basic container networking DNS test after building a cluster:


$ kubectl describe pod validate-dns-linux-4p75n -n default completed in 913.265905ms
 2020/02/21 16:22:35 
 Name:         validate-dns-linux-4p75n
 Namespace:    default
 Priority:     0
 Node:         k8s-agentpool1-13396981-vmss000000/10.240.0.34
 Start Time:   Fri, 21 Feb 2020 16:20:31 +0000
 Labels:       controller-uid=74e04685-aed4-4943-91d4-17eb49e6cd5d
               job-name=validate-dns-linux
 Annotations:  kubernetes.io/psp: privileged
 Status:       Running
 IP:           10.240.0.52
 IPs:
   IP:           10.240.0.52
 Controlled By:  Job/validate-dns-linux
 Containers:
   validate-bing-google:
     Container ID:  containerd://9ea0e6c78af111ff70224d4722d9ce6f0f8303e819bddffad3ebdfe3c73ac61d
     Image:         library/busybox
     Image ID:      docker.io/library/busybox:6915be4043561d64e0ab0f8f098dc2ac48e077fe23f488ac24b665166898115a
     Port:          <none>
     Host Port:     <none>
     Command:
       sh
       -c
       until nslookup www.bing.com || nslookup google.com; do echo waiting for DNS resolution; sleep 1; done;
     State:          Running
       Started:      Fri, 21 Feb 2020 16:20:35 +0000
     Ready:          True
     Restart Count:  0
     Environment:    <none>
     Mounts:
       /var/run/secrets/kubernetes.io/serviceaccount from default-token-rnh6k (ro)
 Conditions:
   Type              Status
   Initialized       True 
   Ready             True 
   ContainersReady   True 
   PodScheduled      True 
 Volumes:
   default-token-rnh6k:
     Type:        Secret (a volume populated by a Secret)
     SecretName:  default-token-rnh6k
     Optional:    false
 QoS Class:       BestEffort
 Node-Selectors:  beta.kubernetes.io/os=linux
 Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                  node.kubernetes.io/unreachable:NoExecute for 300s
 Events:
   Type    Reason     Age        From                                         Message
   ----    ------     ----       ----                                         -------
   Normal  Scheduled  <unknown>  default-scheduler                            Successfully assigned default/validate-dns-linux-4p75n to k8s-agentpool1-13396981-vmss000000
   Normal  Pulling    2m3s       kubelet, k8s-agentpool1-13396981-vmss000000  Pulling image "library/busybox"
   Normal  Pulled     2m         kubelet, k8s-agentpool1-13396981-vmss000000  Successfully pulled image "library/busybox"
   Normal  Created    2m         kubelet, k8s-agentpool1-13396981-vmss000000  Created container validate-bing-google
   Normal  Started    2m         kubelet, k8s-agentpool1-13396981-vmss000000  Started container validate-bing-google
</unknown></none></none></none>

We are getting intermittent failures to receive a terminal zero exit code state of the above on clusters running w/ Azure-built containerd:


$ k get nodes -o json
 2020/02/21 16:14:45 NAME                                 STATUS   ROLES    AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
 k8s-agentpool1-13396981-vmss000000   Ready    <none>   46s   v1.18.0-alpha.5   10.240.0.34    <none>        Ubuntu 16.04.6 LTS   4.15.0-1069-azure   containerd://1.3.2+azure
 k8s-agentpool1-13396981-vmss000001   Ready    <none>   46s   v1.18.0-alpha.5   10.240.0.65    <none>        Ubuntu 16.04.6 LTS   4.15.0-1069-azure   containerd://1.3.2+azure
 k8s-master-13396981-0                Ready    <none>   46s   v1.18.0-alpha.5   10.255.255.5   <none>        Ubuntu 16.04.6 LTS   4.15.0-1069-azure   containerd://1.3.2+azure
</none></none></none></none></none></none>

The errors:


$ k logs validate-dns-linux-4p75n -c validate-bing-google -n default
;; connection timed out; no servers could be reached

 ;; connection timed out; no servers could be reached

 waiting for DNS resolution
 ;; connection timed out; no servers could be reached

 ;; connection timed out; no servers could be reached

 waiting for DNS resolution
 ;; connection timed out; no servers could be reached

 ;; connection timed out; no servers could be reached

 waiting for DNS resolution
 ;; connection timed out; no servers could be reached

 ;; connection timed out; no servers could be reached

 waiting for DNS resolution
 ;; connection timed out; no servers could be reached

 ;; connection timed out; no servers could be reached

 waiting for DNS resolution
 ;; connection timed out; no servers could be reached

 ;; connection timed out; no servers could be reached

 waiting for DNS resolution
 ;; connection timed out; no servers could be reached

 ;; connection timed out; no servers could be reached

 waiting for DNS resolution
 ;; connection timed out; no servers could be reached

 ;; connection timed out; no servers could be reached

 waiting for DNS resolution
 ;; connection timed out; no servers could be reached

 ;; connection timed out; no servers could be reached

 waiting for DNS resolution
 ;; connection timed out; no servers could be reached

 ;; connection timed out; no servers could be reached

 waiting for DNS resolution
 ;; connection timed out; no servers could be reached

We wait up to 2 minutes before throwing an error in E2E.

该提问来源于开源项目:Azure/aks-engine

  • 写回答

10条回答 默认 最新

  • weixin_39862985 2020-11-22 01:37
    关注

    These errors are observed across Kubernetes versions, not restricted to v1.18.0

    评论

报告相同问题?