DRA: attach devices to nodes #5007

pohly · 2024-12-19T07:00:53Z

Enhancement Description

One-line enhancement description: Some network- or fabric-attached devices need to be attached to a node before a pod using them can be scheduled.
Kubernetes Enhancement Proposal:
Discussion Link: DRA: structured parameters: network-attached resource kubernetes#124042 (comment)
Primary contact (assignee): @pohly, @KobayashiD27
Responsible SIGs:
/sig scheduling
/wg device-management
Enhancement target (which target equals to which milestone):
- Alpha release target: 1.33
- Beta release target: TBD
- Stable release target: TBD
Alpha
- KEP (k/enhancements) update PR(s):
- Code (k/k) update PR(s):
- Docs (k/website) update PR(s):

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

The text was updated successfully, but these errors were encountered:

pohly · 2024-12-19T07:02:26Z

/assign @KobayashiD27

As discussed in kubernetes/kubernetes#124042 (comment).

/sig scheduling
/wg device-management

k8s-ci-robot · 2024-12-19T07:02:28Z

@pohly: GitHub didn't allow me to assign the following users: KobayashiD27.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @KobayashiD27

As discussed in kubernetes/kubernetes#124042 (comment).

/sig scheduling
/wg device-management

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

KobayashiD27 · 2024-12-19T07:09:39Z

Thank you for creating the issue. I will post a draft KEP as soon as possible.

KobayashiD27 · 2024-12-19T09:53:19Z

@pohly

To facilitate the discussion on the KEP, we would like to share the design of the composable controller we are considering as a component utilizing the fabric-oriented scheduler function. By sharing this, we believe we can deepen the discussion on the optimal implementation of the scheduler function. Additionally, we would like to verify whether the controller design matches the DRA design.

Background

Our controller's philosophy is to efficiently utilize fabric devices. Therefore, we prefer to allocate devices directly connected to the node over attached fabric devices. (e.g., Node-local devices > Attached fabric devices > Pre-attached fabric devices)

Design Overview

This design aims to efficiently utilize fabric devices, prioritizing node-local devices to improve performance. The composable controller manages fabric devices that can be attached and detached. Therefore, it publishes a list of fabric devices as ResourceSlices.

The structure we are considering is as follows:

# composable controller publish this pool
kind: ResourceSlice
pool: composable-device
driver: gpu.nvidia.com
nodeSelector: fabric1
devices:
  - name: device1
  ...
  - name: device2
  ...

The vendor's DRA kubelet plugin will also publish the devices managed by the vendor as ResourceSlices.

# vendor DRA kubelet plugin publish this pool
kind: ResourceSlice
pool: Node1
driver: gpu.nvidia.com
nodeName: Node1
devices:
  - name: device3
  ...

Here, when the scheduler selects the fabric device device1, it waits for the attachment of the fabric device during PreBind. The composable controller performs the attachment operation by checking the flag of the ResourceClaim. After successful attachment, the composable controller changes the flag of the ResourceClaim.

We are considering the following two methods for handling ResourceSlices upon completion of the attachment. We would like to hear your opinions and feasibility on these two composable controller proposals.

Proposal 1: The composable controller publishes ResourceSlices with NodeName set within the pool

Multiple ResourceSlices are published with the same pool name. One indicates the devices included in the fabric, and the other indicates the devices attached to the node.

# composable controller publish this pool
kind: ResourceSlice
pool: composable-device
driver: gpu.nvidia.com
nodeSelector: fabric1
devices:
  - name: device2
  ...
---
kind: ResourceSlice
pool: composable-device
driver: gpu.nvidia.com
nodeName: Node1
devices:
  - name: device1
  ...

If the vendor's plugin responds to hotplug, device1 will appear in the ResourceSlice published by the vendor.

# vendor DRA kubelet plugin publish this pool
kind: ResourceSlice
pool: Node1
driver: gpu.nvidia.com
nodeName: Node1
devices:
  - name: device3
  ...
  - name: device1
  ...

This may cause device duplication issues between ResourceSlices. To prevent multiple ResourceSlices from publishing duplicate devices, we plan to define a deny list and standardize it with DRA.

Advantages

No need to change the allocationResult by the scheduler or composable controller.
Can distinguish attached fabric devices and maintain prioritization.

Disadvantages

ResourceSlices created by the composable controller may not be understood by the vendor kubelet plugin. (NVIDIA drivers use internal information, so cooperation is needed)
Attached and unattached fabric devices are mixed in one pool. (DRA: structured parameters: network-attached resource kubernetes#124042 (comment))
A mechanism to prevent device duplication is needed (e.g., deny list).

Proposal 2: Attached devices are published by the vendor's plugin

In this case, devices are removed from the composable-device pool.

# composable controller publish this pool
kind: ResourceSlice
pool: composable-device
driver: gpu.nvidia.com
nodeSelector: fabric1
devices:
  - name: device2
  ...

If the vendor's plugin responds to hotplug, device1 will appear in the ResourceSlice published by the vendor.

# vendor DRA kubelet plugin publish this pool
kind: ResourceSlice
pool: Node1
driver: gpu.nvidia.com
nodeName: Node1
devices:
  - name: device3
  ...
  - name: device1
  ...

This breaks the linkage between ResourceClaim and ResourceSlice. Therefore, it is necessary to modify the AllocationResult of the ResourceClaim.

Advantages

Simplifies device management.
Centralizes management as the vendor's plugin directly publishes devices.
No need for mechanisms to prevent device duplication (e.g., deny list).

Disadvantages

Cannot distinguish attached fabric devices, making prioritization difficult.
Requires modification of the linkage between ResourceClaim and ResourceSlice (expected to be done by the scheduler or DRA controller. Which is more appropriate?).
Until the linkage is fixed, the device being used may be published as a ResourceSlice and reserved by other Pods.

We would appreciate your feedback and insights on these proposals to ensure the optimal implementation of the scheduler function and alignment with the DRA design.

pohly · 2024-12-19T11:13:16Z

Let's keep the discussion in this issue shorter. You now can put all of this, including the alternatives, into the KEP document.

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 19, 2024

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 19, 2024

github-project-automation bot added this to SIG Node: Dynamic Resource Allocation and SIG Scheduling Dec 19, 2024

github-project-automation bot moved this to 🆕 New in SIG Node: Dynamic Resource Allocation Dec 19, 2024

github-project-automation bot moved this to Needs Triage in SIG Scheduling Dec 19, 2024

pohly mentioned this issue Dec 19, 2024

DRA: structured parameters: network-attached resource kubernetes/kubernetes#124042

Open

KobayashiD27 mentioned this issue Dec 20, 2024

KEP-5007: DRA Device Attach Before Pod Scheduled #5012

Open

pohly moved this from 🆕 New to 🏗 In progress in SIG Node: Dynamic Resource Allocation Dec 20, 2024

pohly moved this from 🏗 In progress to 🆕 New in SIG Node: Dynamic Resource Allocation Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRA: attach devices to nodes #5007

DRA: attach devices to nodes #5007

pohly commented Dec 19, 2024

pohly commented Dec 19, 2024

k8s-ci-robot commented Dec 19, 2024

KobayashiD27 commented Dec 19, 2024

KobayashiD27 commented Dec 19, 2024

pohly commented Dec 19, 2024

DRA: attach devices to nodes #5007

DRA: attach devices to nodes #5007

Comments

pohly commented Dec 19, 2024

Enhancement Description

pohly commented Dec 19, 2024

k8s-ci-robot commented Dec 19, 2024

KobayashiD27 commented Dec 19, 2024

KobayashiD27 commented Dec 19, 2024

Background

Design Overview

Proposal 1: The composable controller publishes ResourceSlices with NodeName set within the pool

Advantages

Disadvantages

Proposal 2: Attached devices are published by the vendor's plugin

Advantages

Disadvantages

pohly commented Dec 19, 2024