feat(gpu): introduce GPU request spec#1156
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
8b95e09 to
3d17c4e
Compare
3d17c4e to
930c581
Compare
|
🌿 Preview your docs: https://nvidia-preview-pr-1156.docs.buildwithfern.com/openshell |
07b171d to
dd18d21
Compare
9ff168a to
cfceac9
Compare
|
Label |
ff9e99d to
76af3a3
Compare
7f09b83 to
3c6a501
Compare
| message GpuRequestSpec { | ||
| // Optional driver-native device identifiers. Empty means the driver chooses | ||
| // its default GPU assignment behavior. | ||
| repeated string device_id = 1; |
There was a problem hiding this comment.
From the PR description
Replace legacy GPU-specific request fields in the public and driver protos with ResourceRequirements carrying a GPUSpec. This also adds GPU count requests while preserving the existing default GPU request shape used by --gpu and image-name auto-detection.
Where do we support counts in this PR?
Another thought, with the device_config convention we're trying to establish, does it make as much sense to include device_id in this proto? device_id could get represented from the driver_config, and gpu could simply represent the number of gpus to request.
There was a problem hiding this comment.
Where do we support counts in this PR?
Looking locally, I think I had tested a version of this branch on a machine with GPUs and then pushed these changes without first pulling the commit including counts or forgot to push the local changes).
does it make as much sense to include device_id in this proto? device_id could get represented from the driver_config, and gpu could simply represent the number of gpus to request.
Using driver_config for GPU ids is an interesting idea and may make sense since the actual IDs available depend on the driver implementation (e.g. index vs BDF vs CDI device name).
This probably means that we're looking at the following shape:
driver_config:
docker:
device_ids:
- nvidia.com/gpu=0
podman:
device_ids:
- nvidia.com/gpu=0
vm:
device_ids:
- 0
This would select the specified devices when --gpu-count=1 (assuming that we extend the docker and podman drivers to support this). We would also remove / deprecate the --gpu-device command line flag in this case and have --gpu imply --gpu-count=1. Note that the device_ids would ONLY be processed if a count is actually specified so that we don't allow driver-config to override sandbox-specific resources.
Let's me use this as a concrete driver for an implementaion of the proposal in #1589.
There was a problem hiding this comment.
See #1716 as a draft POC for what this would look like. I still need to make another pass or two though.
7bdaa73 to
23bdbb1
Compare
|
Compatibility note: this revision intentionally keeps the proto compact while the API is still pre-alpha. It reuses the old direct GPU field position for |
|
@drew I readded the counts (from local history), and then went a step further and added the This is currently a separate commit, so I can roll it back if we're more comfortable with the initial change. |
| gpu, | ||
| gpu_device.as_deref(), | ||
| gpu_count, |
There was a problem hiding this comment.
As a follow-up: Does it make sense to introduce a type for gpu and other resources going forward?
| gpu: bool, | ||
| gpu_device: Option<&str>, | ||
| gpu_count: Option<u32>, | ||
| cpu: Option<&str>, | ||
| memory: Option<&str>, |
There was a problem hiding this comment.
For follow-up: Let's add a struct that handles these resources together instead of adding new arguments.
| pub fn cdi_gpu_device_ids(gpu: Option<&DriverGpuResourceRequirement>) -> Option<Vec<String>> { | ||
| match gpu { | ||
| Some(gpu) if gpu.device_ids.is_empty() => Some(vec![CDI_GPU_DEVICE_ALL.to_string()]), | ||
| Some(gpu) => Some(gpu.device_ids.clone()), | ||
| None => None, | ||
| } |
There was a problem hiding this comment.
One of the core changes of #1675 is to not default to nvidia.com/gpu=all when an empty device_id is specified, but instead select the "next" GPU id. We're currently keeping the behaviour unchanged here as part of this PR, but it may be useful to call this out in a comment.
There was a problem hiding this comment.
Another note is that #1675 builds on this PR to also add support for count-based GPU requests.
There was a problem hiding this comment.
Done. cdi_gpu_device_ids now documents that an empty explicit device list preserves the current all-GPU CDI default for this PR.
| fn driver_gpu_requirement( | ||
| spec: &openshell_core::proto::compute::v1::DriverSandboxSpec, | ||
| ) -> Option<&DriverGpuResourceRequirement> { | ||
| spec.resource_requirements | ||
| .as_ref() | ||
| .and_then(|requirements| requirements.gpu.as_ref()) | ||
| } |
There was a problem hiding this comment.
As a question: Is this a common Rust pattern to extract a reference to a specific field (or sub-field) form a type?
There was a problem hiding this comment.
Another question: This specific function does not seem docker-specific. Should it be moved to somewhere like crates/openshell-core/src/gpu.rs if it is also present in podman or vm implementations?
There was a problem hiding this comment.
Done. The shared extraction helper now lives in openshell_core::gpu, and Docker/Podman/Kubernetes/VM use the common driver_gpu_requirement path.
| if let Some(gpu) = sandbox.spec.as_ref().and_then(driver_gpu_requirement) | ||
| && gpu_has_explicit_device_ids(Some(gpu)) | ||
| { | ||
| return Err(KubernetesDriverError::Precondition( | ||
| "kubernetes compute driver does not support explicit GPU device IDs".to_string(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
Why are we seemingly duplicating the sandbox validation here?
| sandbox_template_to_k8s( | ||
| &SandboxTemplate::default(), | ||
| spec.is_some_and(|s| s.gpu), | ||
| spec.and_then(driver_gpu_requirement), |
There was a problem hiding this comment.
What is the difference between spec.and_then(driver_gpu_requirement) and just calling driver_gpu_requirement(spec). If they're equivalent, can we settle on one of these for consistency?
| static ENV_LOCK: std::sync::LazyLock<std::sync::Mutex<()>> = | ||
| std::sync::LazyLock::new(|| std::sync::Mutex::new(())); | ||
|
|
||
| fn gpu_request(count: Option<u32>) -> DriverGpuResourceRequirement { |
There was a problem hiding this comment.
Why do we have multiple mechanisms to construct a DriverGpuResourceRequirement in the kubernetes driver? This particular method seems to only be used in testing. If this is the case, is there a way to mark it as such. Furthermore, if we could "normalize" the construction, so that we always construct a DriverGpuResourceRequirement internally with the default count (i.e. the default request), then we don't need to remember to handle the special case when Count is not specified.
There was a problem hiding this comment.
Done. The construction helpers are now local to the Kubernetes test module, with default_gpu_request() for the default request and gpu_request(count) only for count-specific cases.
| fn gpu_has_explicit_device_ids_only_when_ids_are_present() { | ||
| assert!(!gpu_has_explicit_device_ids(None)); | ||
| assert!(!gpu_has_explicit_device_ids(Some( | ||
| &DriverGpuResourceRequirement { | ||
| device_ids: vec![], | ||
| count: None, | ||
| } | ||
| ))); | ||
| assert!(gpu_has_explicit_device_ids(Some( | ||
| &DriverGpuResourceRequirement { | ||
| device_ids: vec!["nvidia.com/gpu=0".to_string()], | ||
| count: None, | ||
| } | ||
| ))); | ||
| } |
There was a problem hiding this comment.
Why are we testing this in the kubernetes driver? Explicit device IDs are not supported here.
There was a problem hiding this comment.
Done. The Kubernetes test now verifies that explicit GPU device IDs are rejected, since Kubernetes does not support device-specific requests in this PR.
4c18100 to
cec0e21
Compare
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
cec0e21 to
acfb3d4
Compare
Summary
Adds compact GPU resource requirements to the public and compute-driver protos, moving the API toward the resource-requirements shape proposed in #1360.
Related Issue
Related to #1444 and #1360
Closes #1338
Changes
SandboxSpec.gpuwithSandboxSpec.resource_requirements.gpu.device_ids.Testing
mise run pre-commitChecklist