weixin_39914732
weixin_39914732
2020-12-02 14:06

CPU Usage slowly increasing

Summary

In our deployment the csi-driver will slowly raise its CPU usage over multiple days.

hcloud-csi-cpu-leak

I killed the pod on 09/17 16:00, causing the usage to "reset".

Access Pattern

  • We have a total of 5 persistent volumes created in the cluster
  • 1 is unbound (no pvc)
  • 2 are associated with prometheus and rarely need to be reassigned
  • 2 are associated with cronjobs that run on a 5 minute schedule, requiring the volumes to be attached+mounted every 5 minutes (I suspect this might leak some watches/internal timers that cause the growing CPU usage)

➜  ~ kubectl get pvc --all-namespaces
NAMESPACE                         NAME                                                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS     AGE
cattle-prometheus                 prometheus-cluster-monitoring-db-prometheus-cluster-monitoring-0     Bound    pvc-5e4ed3ef-c26d-11e9-9e15-9600002e0c70   10Gi       RWO            hcloud-volumes   35d
platform-monitoring               prometheus-platform-monitoring-db-prometheus-platform-monitoring-0   Bound    pvc-c01b6132-d8b6-11e9-9ef9-9600002e0c6f   10Gi       RWO            hcloud-volumes   6d17h
REDACTED                          REDACTED                                                             Bound    pvc-ff65820a-c1da-11e9-981f-9600002e0ca4   10Gi       RWO            hcloud-volumes   35d
REDACTED                          REDACTED-2                                                           Bound    pvc-d47da832-bda7-11e9-912b-960000172088   10Gi       RWO            hcloud-volumes   41d
➜  ~ kubectl get pv --all-namespaces 
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                                                                    STORAGECLASS     REASON   AGE
pvc-49de8eef-adfe-11e9-912b-960000172088   10Gi       RWO            Retain           Released   cattle-prometheus/prometheus-cluster-monitoring-db-prometheus-cluster-monitoring-0       hcloud-volumes            61d
pvc-5e4ed3ef-c26d-11e9-9e15-9600002e0c70   10Gi       RWO            Delete           Bound      cattle-prometheus/prometheus-cluster-monitoring-db-prometheus-cluster-monitoring-0       hcloud-volumes            35d
pvc-c01b6132-d8b6-11e9-9ef9-9600002e0c6f   10Gi       RWO            Delete           Bound      platform-monitoring/prometheus-platform-monitoring-db-prometheus-platform-monitoring-0   hcloud-volumes            6d17h
pvc-d47da832-bda7-11e9-912b-960000172088   10Gi       RWO            Delete           Bound      REDACTED/REDACTED-2                                                                      hcloud-volumes            41d
pvc-ff65820a-c1da-11e9-981f-9600002e0ca4   10Gi       RWO            Delete           Bound      REDACTED/REDACTED                                                                        hcloud-volumes            35d

Details

  • CSI-Driver Version: 1.1.5
  • K8s Version: 1.14.5

➜  ~ kubectl version  
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.0", GitCommit:"641856db18352033a0d96dbc99153fa3b27298e5", GitTreeState:"clean", BuildDate:"2019-03-25T15:53:57Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.5", GitCommit:"0e9fcb426b100a2aea5ed5c25b3d8cfbb01a8acf", GitTreeState:"clean", BuildDate:"2019-08-05T09:13:08Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Logs from Container hcloud-csi-driver directly after restart


level=debug ts=2019-09-23T12:34:59.042539352Z msg="getting instance id from metadata service"
level=debug ts=2019-09-23T12:34:59.048672227Z msg="fetching server"
level=info ts=2019-09-23T12:34:59.386696775Z msg="fetched server" server-name=compute4
level=debug ts=2019-09-23T12:34:59.477231815Z component=grpc-server msg="handling request" req=
level=debug ts=2019-09-23T12:34:59.477353825Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:34:59.480500769Z component=grpc-server msg="handling request" req=
level=debug ts=2019-09-23T12:34:59.480590227Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:34:59.484366757Z component=grpc-server msg="handling request" req=
level=debug ts=2019-09-23T12:34:59.484420039Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:34:59.488958315Z component=grpc-server msg="handling request" req=
level=debug ts=2019-09-23T12:34:59.489026328Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:35:00.301813104Z component=grpc-server msg="handling request" req=
level=debug ts=2019-09-23T12:35:00.301896393Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:35:00.31035102Z component=grpc-server msg="handling request" req=
level=debug ts=2019-09-23T12:35:00.310428105Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:35:00.314012584Z component=grpc-server msg="handling request" req=
level=debug ts=2019-09-23T12:35:00.314080245Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:35:00.318897099Z component=grpc-server msg="handling request" req=
level=debug ts=2019-09-23T12:35:00.318935292Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:35:08.986510143Z component=grpc-server msg="handling request" req="volume_id:\"3098265\" node_id:\"3142363\" volume_capability:<mount:> access_mode:<single_node_writer> > volume_context:<key: value:> "
level=info ts=2019-09-23T12:35:08.98664899Z component=api-volume-service msg="attaching volume" volume-id=3098265 server-id=3142363
level=debug ts=2019-09-23T12:35:12.282519058Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:35:35.058390576Z component=grpc-server msg="handling request" req="volume_id:\"3098265\" node_id:\"3142363\" "
level=info ts=2019-09-23T12:35:35.058546437Z component=api-volume-service msg="detaching volume from server" volume-id=3098265 server-id=3142363
level=debug ts=2019-09-23T12:35:37.349104517Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:40:01.540748703Z component=grpc-server msg="handling request" req="volume_id:\"3098265\" node_id:\"3142354\" volume_capability:<mount:> access_mode:<single_node_writer> > volume_context:<key: value:> "
level=info ts=2019-09-23T12:40:01.540829349Z component=api-volume-service msg="attaching volume" volume-id=3098265 server-id=3142354
level=debug ts=2019-09-23T12:40:07.529582765Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:40:07.547330301Z component=grpc-server msg="handling request" req="volume_id:\"3098265\" node_id:\"3142354\" volume_capability:<mount:> access_mode:<single_node_writer> > volume_context:<key: value:> "
level=info ts=2019-09-23T12:40:07.547552922Z component=api-volume-service msg="attaching volume" volume-id=3098265 server-id=3142354
level=info ts=2019-09-23T12:40:08.1849406Z component=api-volume-service msg="failed to attach volume" volume-id=3098265 server-id=3142354 err="volume is already attached to a server, please detach it first (service_error)"
level=debug ts=2019-09-23T12:40:08.225571172Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:40:27.167239085Z component=grpc-server msg="handling request" req="volume_id:\"3098265\" node_id:\"3142354\" "
level=info ts=2019-09-23T12:40:27.167581839Z component=api-volume-service msg="detaching volume from server" volume-id=3098265 server-id=3142354
level=debug ts=2019-09-23T12:40:29.483828719Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:45:03.489533212Z component=grpc-server msg="handling request" req="volume_id:\"3098265\" node_id:\"3142352\" volume_capability:<mount:> access_mode:<single_node_writer> > volume_context:<key: value:> "
level=info ts=2019-09-23T12:45:03.489663779Z component=api-volume-service msg="attaching volume" volume-id=3098265 server-id=3142352
level=debug ts=2019-09-23T12:45:07.818956291Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:45:34.322893995Z component=grpc-server msg="handling request" req="volume_id:\"3098265\" node_id:\"3142352\" "
level=info ts=2019-09-23T12:45:34.322972245Z component=api-volume-service msg="detaching volume from server" volume-id=3098265 server-id=3142352
level=debug ts=2019-09-23T12:45:36.618721636Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:50:05.781757496Z component=grpc-server msg="handling request" req="volume_id:\"3098265\" node_id:\"3142350\" volume_capability:<mount:> access_mode:<single_node_writer> > volume_context:<key: value:> "
level=info ts=2019-09-23T12:50:05.781999032Z component=api-volume-service msg="attaching volume" volume-id=3098265 server-id=3142350
level=debug ts=2019-09-23T12:50:08.793330099Z component=grpc-server msg="finished handling request"
level=debug ts=2019-09-23T12:50:32.28802285Z component=grpc-server msg="handling request" req="volume_id:\"3098265\" node_id:\"3142350\" "
level=info ts=2019-09-23T12:50:32.288189703Z component=api-volume-service msg="detaching volume from server" volume-id=3098265 server-id=3142350
level=debug ts=2019-09-23T12:50:35.147710534Z component=grpc-server msg="finished handling request"
</key:></single_node_writer></mount:></key:></single_node_writer></mount:></key:></single_node_writer></mount:></key:></single_node_writer></mount:></key:></single_node_writer></mount:>
CPU Usage over same timeframe

_Note_: CPU Usage is already slowly increasing ![grafana-csi-cpu-usage_000](https://user-images.githubusercontent.com/3393079/65427362-724df500-de12-11e9-94b9-87e730cafd0a.png)

该提问来源于开源项目:hetznercloud/csi-driver

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

12条回答

  • weixin_39914732 weixin_39914732 5月前

    We deploy our Kubernetes Cluster using Rancher with the docker-machine-driver-hetzner and ui-driver-hetzner.

    Rancher: v2.3.2 Kubernetes: v1.16.3 Docker: v18.9.9 Server Type: cx21 Server Location: nbg1-dc3 (todays test), fsn1-dc14 (prod)

    gRPC Responses are 50/50 OK/Unavailable (v1.2.2):

    hcloud-csi-cpu-response-code

    点赞 评论 复制链接分享
  • weixin_39871562 weixin_39871562 5月前

    Okay, thank you. We will try to reproduce this. My first idea is, that this is normal behavior.

    The process of attaching/detaching volumes is really CPU intensive for the controller. We saw on our test cluster (with CCX31) that the controller uses ~0,5% of CPU. A CCX31 has 8 (dedicated) Cores.

    As a general tough we recommend using CCX types for k8s workloads.

    点赞 评论 复制链接分享
  • weixin_39871562 weixin_39871562 5月前

    I have monitored this now on some cx21 servers over a month. I can not see an unnormal behavior. The controller needs some resources because he does a lot of attaching and detaching requests (and watch on the result). Therefore I will close this issue.

    点赞 评论 复制链接分享
  • weixin_39914732 weixin_39914732 5月前

    I re-ran the test with the new versions and the problem still persists for me :frowning_face:.

    hcloud-csi-cpu-usage

    点赞 评论 复制链接分享
  • weixin_39871562 weixin_39871562 5月前

    Okay, could you give use some more information about your setup?

    Which k8s version? Which server types (and location)?

    GRPC related: Which codes are returned? (https://grafana.com/grafana/dashboards/9186) --> Graph "gRPC request code"

    点赞 评论 复制链接分享
  • weixin_39871562 weixin_39871562 5月前

    Bildschirmfoto 2019-12-10 um 08 10 30

    I guess we have found the issue 🎉 It was a mix of return of wrong error codes and the k8s retrying mechanism which runs out of control...

    Legend: 1. Before our change 2. Remove of the "Aborted" Error Code 3. Adding dedicated error code für volume already attached

    I will prepare the PR shortly. Thank you for your help!

    Could you try this specific container: lkdevelopment/csi-driver:1.2.2 . it contains the fix (will be removed after the release)

    cc

    点赞 评论 复制链接分享
  • weixin_39914732 weixin_39914732 5月前

    I was able to quickly reproduce this issue in a fresh cluster using 5 cronjobs on a 1-minute schedule.

    grafana-csi-cpu-usage-2_000

    manifest for cronjobs
    yaml
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: test-vol-1
      namespace: default
    spec:
      accessModes:
        - ReadWriteOnce
      volumeMode: Filesystem
      resources:
        requests:
          storage: 10Gi
    
    ---
    
    apiVersion: batch/v1beta1
    kind: CronJob
    metadata:
      name: test-job-1
      namespace: default
    spec:
      schedule: "*/1 * * * *"
      jobTemplate:
        spec:
          template:
            spec:
              restartPolicy: Never
              containers:
              - name: test
                image: busybox
                command: ["echo", "Hello"]
                volumeMounts:
                  - mountPath: /test
                    name: test
              volumes:
                - name: test
                  persistentVolumeClaim:
                    claimName: test-vol-1
    
    ---
    
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: test-vol-2
      namespace: default
    spec:
      accessModes:
        - ReadWriteOnce
      volumeMode: Filesystem
      resources:
        requests:
          storage: 10Gi
    
    ---
    
    apiVersion: batch/v1beta1
    kind: CronJob
    metadata:
      name: test-job-2
      namespace: default
    spec:
      schedule: "*/1 * * * *"
      jobTemplate:
        spec:
          template:
            spec:
              restartPolicy: Never
              containers:
              - name: test
                image: busybox
                command: ["echo", "Hello"]
                volumeMounts:
                  - mountPath: /test
                    name: test
              volumes:
                - name: test
                  persistentVolumeClaim:
                    claimName: test-vol-2
    
    ---
    
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: test-vol-3
      namespace: default
    spec:
      accessModes:
        - ReadWriteOnce
      volumeMode: Filesystem
      resources:
        requests:
          storage: 10Gi
    
    ---
    
    apiVersion: batch/v1beta1
    kind: CronJob
    metadata:
      name: test-job-3
      namespace: default
    spec:
      schedule: "*/1 * * * *"
      jobTemplate:
        spec:
          template:
            spec:
              restartPolicy: Never
              containers:
              - name: test
                image: busybox
                command: ["echo", "Hello"]
                volumeMounts:
                  - mountPath: /test
                    name: test
              volumes:
                - name: test
                  persistentVolumeClaim:
                    claimName: test-vol-3
    
    
    ---
    
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: test-vol-4
      namespace: default
    spec:
      accessModes:
        - ReadWriteOnce
      volumeMode: Filesystem
      resources:
        requests:
          storage: 10Gi
    
    ---
    
    apiVersion: batch/v1beta1
    kind: CronJob
    metadata:
      name: test-job-4
      namespace: default
    spec:
      schedule: "*/1 * * * *"
      jobTemplate:
        spec:
          template:
            spec:
              restartPolicy: Never
              containers:
              - name: test
                image: busybox
                command: ["echo", "Hello"]
                volumeMounts:
                  - mountPath: /test
                    name: test
              volumes:
                - name: test
                  persistentVolumeClaim:
                    claimName: test-vol-4
    
    ---
    
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: test-vol-5
      namespace: default
    spec:
      accessModes:
        - ReadWriteOnce
      volumeMode: Filesystem
      resources:
        requests:
          storage: 10Gi
    
    ---
    
    apiVersion: batch/v1beta1
    kind: CronJob
    metadata:
      name: test-job-5
      namespace: default
    spec:
      schedule: "*/1 * * * *"
      jobTemplate:
        spec:
          template:
            spec:
              restartPolicy: Never
              containers:
              - name: test
                image: busybox
                command: ["echo", "Hello"]
                volumeMounts:
                  - mountPath: /test
                    name: test
              volumes:
                - name: test
                  persistentVolumeClaim:
                    claimName: test-vol-5
    
    点赞 评论 复制链接分享
  • weixin_39778400 weixin_39778400 5月前

    Maybe it would make sense to export some Go runtime metrics via Prometheus to help debug this.

    点赞 评论 复制链接分享
  • weixin_39871562 weixin_39871562 5月前

    could you try it with version 1.2.0 again?

    We have the same deployment for a few days running in one of our test clusters and can't see a CPU usage like on your screenshots.

    点赞 评论 复制链接分享
  • weixin_39914732 weixin_39914732 5月前

    I will do that once I have some free time.

    点赞 评论 复制链接分享
  • weixin_39914732 weixin_39914732 5月前

    I can still reproduce the issue (CSI Driver v1.2.1, K8s v1.16.3).

    I setup dashboards with data from #67 today, so perhaps I will have some results tomorrow.

    点赞 评论 复制链接分享
  • weixin_39914732 weixin_39914732 5月前

    I could not detect any noticable changes in the metrics that correlate with the increased CPU usage.

    For reference, I used the dashboards Go gRPC1 and Go Runtime, as well as some self-built charts to visualize the metrics.

    点赞 评论 复制链接分享

相关推荐