-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topological alignment between GPUs and NICs in DRA (exposing pci device topology as device attribute?) #213
Comments
I'm open to suggestions on what these attributes would look like and how they would be used, but as I mentioned in my comment here #214 (comment), I've struggled to come up with something that would actually be useful. |
Thanks,
How about this? If this driver provides such knob, user will be able to publish their own extra attributes for thier needs. |
For this use case, I found the presentation excactly matched this case. So, If both NVIDIA/k8s-dra-driver and kubernetes-sigs/cni-dra-driver exposed apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaim
metadata:
name: big-gpu-with-aligned-nic
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.nvidia.com
selectors:
- cel:
expression: "device capacity['memory'].compareTo(quantity ('80Gi')) >= 0"
- name: nic
deviceClassName: rdma.nvidia.com
selectors:
- cel:
expression: "device.attribute[ 'sriovType'] == 'vf'"
constraints:
- requestNames: ["gpu", "nic"]
matchAttribute: k8s.io/pcieRoot Thus, I would like to know if NVIDIA/k8s-dra-driver plans to expose Thanks in advance.
Because #214 clearly describes this case, I re-phrased this issue title for isolated discussion. |
Unfortunately, we can't include this until we start to standardize the set of attributes we put under the |
Thanks for the quick reply. OK, then, let me keep this open for now. |
I understand DRA will finally promote to Beta in v1.32🎉 Thank you very much contributors for your hard work standardizing flexible device scheduling and implementing NVIDIA's dra-driver.
Do you have a plan exposing intra-node topology as device attribute?? Especially distances between GPU<->GPU and GPU<->NIC or HCA (I imagine
nvidia-smi topo -m
equivalent information)? Or, would you have a plan to provide some extension point to add user-defined device attribute in this dar-driver??I imagine below usecases for optimizing training performance:
Single Node Multi GPUs:a user wants to have 1 pod with 2 gpus which are connected via NVLink each other (NV#
innvidia-smi topo -m
)→ discussed in NVLINK Aware Scheduling #214
PIX
innvidia-smi topo -m
) in specific zone(achieved by node selector)Thanks, in advance.
The text was updated successfully, but these errors were encountered: