weixin_39805180
weixin_39805180
2020-12-09 08:52

gpu plugin device can only be used by one pod

Hi, currently gpu plugin only exposes one device instance e.g one 'card0' with device node /dev/dri/card0 and /dev/dri/renderD128 for service, so only one pod can use it. But as drm device node could be accessed by any number of clients, only limit to one pod is not good, gpu device can be utilized by many more pods.

I'm not sure what's the best way to handle this. One option is to be able to pass max number of pods for gpu access when plugin start, then plugin will report that number of devices to kubelet service, which could server more pods for gpu access. Idea?

该提问来源于开源项目:intel/intel-device-plugins-for-kubernetes

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

8条回答

  • weixin_39972567 weixin_39972567 5月前

    Sorry for the delay. I'm just back from holidays.

    Currently over-commitment of extended resources isn't supported by K8s. Once allocated a device stops being available, the scheduler simply can't schedule a pod on the same node (given there is one device on the node). Semantically it would make more sense if a user could request a fraction of a device, but this contradicts to the spec for K8s extended resources.

    But we could advertise one physical device as many virtual GPUs though, e.g. 10 of gpu.intel.com/i915-one-tenth. I think the exact number of such virtual devices per physical one could be configured with the plugin's command line options or node attributes.

    点赞 评论 复制链接分享
  • weixin_39805180 weixin_39805180 5月前

    Well GPU device can serve many clients simultaneously, so restrict it for only one user doesn't serve practical usage.

    I'm not sure what you mean by exposing "virtual GPU", does current k8s device plugin have support for virtual device entry or that's just a way to provide number of devices can be used for pods?

    点赞 评论 复制链接分享
  • weixin_39972567 weixin_39972567 5月前

    Well GPU device can serve many clients simultaneously, so restrict it for only one user doesn't serve practical usage.

    Understandable. I'm not very knowledgeable about GPUs: is it possible to guarantee that a pod won't abuse a shared GPU and exhaust its resources (e.g. by allocating memory for a huge texture) so other pods won't starve. Then can a process access and read a GEM BO created by another process from inside a different pod? This is important especially in multi-tenant installations.

    I'm not sure what you mean by exposing "virtual GPU", does current k8s device plugin have support for virtual device entry or that's just a way to provide number of devices can be used for pods?

    The latter. The same real device node (/dev/dri/card0) can be mounted to many containers in different pods as /dev/dri/card0, /dev/dri/card1,..., /dev/dri/cardXX.

    点赞 评论 复制链接分享
  • weixin_39805180 weixin_39805180 5月前

    DRM linux has no memory control on GPU now, as current intel gpu just use host memory for GPU, so depend on shmem fs for allocation. Usage for pod is similar to usage on native GPU processes.

    BO sharing requires authenticated client which is normally handled between display manager vs. gfx library in applications. So there's no explicit way for between pods BO access.

    I think when you expose to kubelet, you needs to provide a device list named e.g card0, card1, card2...but for device access path if for current one intel gpu device should all be /dev/dri/card0 and /dev/dri/renderD128.

    So I think first step we can add an option to gpu plugin to set max gpu instances for pods which would generate that number of device nodes to kubelet.

    点赞 评论 复制链接分享
  • weixin_39972567 weixin_39972567 5月前

    So I think first step we can add an option to gpu plugin to set max gpu instances for pods which would generate that number of device nodes to kubelet.

    Agree. This might be useful for users running their own private clusters.

    Please disregard my last sentence in the previous comment. kubelet doesn't really care about the names of devices nodes. It just needs different device IDs like card0-0, card0-1,..., card0-X corresponding to the same content of DeviceSpec.HostPath and DeviceSpec.ContainerPath, so containers sharing the same GPU would see it as /dev/card0, /dev/renderD128.

    点赞 评论 复制链接分享
  • weixin_39805180 weixin_39805180 5月前

    Just wonder if user assigns multiple intel.com/gpu resources in pod yaml, would that cause problem? My guess is not, as dev path is same so container runtime should still apply for it, right?

    And I'd like to know if you like to implement new option for this, or like to see a PR.

    点赞 评论 复制链接分享
  • weixin_39972567 weixin_39972567 5月前

    Just wonder if user assigns multiple intel.com/gpu resources in pod yaml, would that cause problem? My guess is not, as dev path is same so container runtime should still apply for it, right?

    Right, that's my expectation too.

    And I'd like to know if you like to implement new option for this, or like to see a PR.

    I'll implement it.

    点赞 评论 复制链接分享
  • weixin_39805180 weixin_39805180 5月前

    Thanks. Looks good to me!

    点赞 评论 复制链接分享

相关推荐