feat(gpu): introduce GPU request spec by elezar · Pull Request #1156 · NVIDIA/OpenShell

elezar · 2026-05-04T20:36:12Z

Summary

Adds compact GPU resource requirements to the public and compute-driver protos, moving the API toward the resource-requirements shape proposed in #1360.

Related Issue

Related to #1444 and #1360

Closes #1338

Changes

Replaces SandboxSpec.gpu with SandboxSpec.resource_requirements.gpu.
Supports default GPU requests, GPU count requests, and interim driver-specific device_ids.
Preserves driver behavior: Kubernetes honors counts, VM accepts one GPU, Docker/Podman reject count requests while using CDI device IDs.
Updates CLI mapping, gateway validation/translation, tests, and docs.

Testing

mise run pre-commit

Checklist

Follows Conventional Commits
Docs updated

copy-pr-bot · 2026-05-04T20:36:16Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions · 2026-05-13T16:45:22Z

🌿 Preview your docs: https://nvidia-preview-pr-1156.docs.buildwithfern.com/openshell

github-actions · 2026-05-18T13:45:05Z

Label test:e2e-gpu applied for cfceac9. Open the existing run and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

drew · 2026-06-03T06:39:59Z

+message GpuRequestSpec {
+  // Optional driver-native device identifiers. Empty means the driver chooses
+  // its default GPU assignment behavior.
+  repeated string device_id = 1;


From the PR description

Replace legacy GPU-specific request fields in the public and driver protos with ResourceRequirements carrying a GPUSpec. This also adds GPU count requests while preserving the existing default GPU request shape used by --gpu and image-name auto-detection.

Where do we support counts in this PR?

Another thought, with the device_config convention we're trying to establish, does it make as much sense to include device_id in this proto? device_id could get represented from the driver_config, and gpu could simply represent the number of gpus to request.

Where do we support counts in this PR?

Looking locally, I think I had tested a version of this branch on a machine with GPUs and then pushed these changes without first pulling the commit including counts or forgot to push the local changes).

does it make as much sense to include device_id in this proto? device_id could get represented from the driver_config, and gpu could simply represent the number of gpus to request.

Using driver_config for GPU ids is an interesting idea and may make sense since the actual IDs available depend on the driver implementation (e.g. index vs BDF vs CDI device name).

This probably means that we're looking at the following shape:

driver_config: docker: device_ids: - nvidia.com/gpu=0 podman: device_ids: - nvidia.com/gpu=0 vm: device_ids: - 0

This would select the specified devices when --gpu-count=1 (assuming that we extend the docker and podman drivers to support this). We would also remove / deprecate the --gpu-device command line flag in this case and have --gpu imply --gpu-count=1. Note that the device_ids would ONLY be processed if a count is actually specified so that we don't allow driver-config to override sandbox-specific resources.

Let's me use this as a concrete driver for an implementaion of the proposal in #1589.

See #1716 as a draft POC for what this would look like. I still need to make another pass or two though.

elezar · 2026-06-03T13:08:42Z

Compatibility note: this revision intentionally keeps the proto compact while the API is still pre-alpha. It reuses the old direct GPU field position for resource_requirements and does not reserve the removed GPU-related tags/names. If reviewers prefer compatibility protection here, we can reserve the old fields in a follow-up revision.

elezar · 2026-06-03T14:08:25Z

@drew I readded the counts (from local history), and then went a step further and added the ResourceRequirements envelope instead of doing that as a follow up.

This is currently a separate commit, so I can roll it back if we're more comfortable with the initial change.

elezar · 2026-06-03T19:14:02Z

                        gpu,
                        gpu_device.as_deref(),
+                        gpu_count,


As a follow-up: Does it make sense to introduce a type for gpu and other resources going forward?

elezar · 2026-06-03T19:15:30Z

    gpu: bool,
    gpu_device: Option<&str>,
+    gpu_count: Option<u32>,
    cpu: Option<&str>,
    memory: Option<&str>,


For follow-up: Let's add a struct that handles these resources together instead of adding new arguments.

elezar · 2026-06-03T19:18:09Z

+pub fn cdi_gpu_device_ids(gpu: Option<&DriverGpuResourceRequirement>) -> Option<Vec<String>> {
+    match gpu {
+        Some(gpu) if gpu.device_ids.is_empty() => Some(vec![CDI_GPU_DEVICE_ALL.to_string()]),
+        Some(gpu) => Some(gpu.device_ids.clone()),
+        None => None,
+    }


One of the core changes of #1675 is to not default to nvidia.com/gpu=all when an empty device_id is specified, but instead select the "next" GPU id. We're currently keeping the behaviour unchanged here as part of this PR, but it may be useful to call this out in a comment.

Another note is that #1675 builds on this PR to also add support for count-based GPU requests.

Done. cdi_gpu_device_ids now documents that an empty explicit device list preserves the current all-GPU CDI default for this PR.

elezar · 2026-06-03T19:21:53Z

+fn driver_gpu_requirement(
+    spec: &openshell_core::proto::compute::v1::DriverSandboxSpec,
+) -> Option<&DriverGpuResourceRequirement> {
+    spec.resource_requirements
+        .as_ref()
+        .and_then(|requirements| requirements.gpu.as_ref())
+}


As a question: Is this a common Rust pattern to extract a reference to a specific field (or sub-field) form a type?

Another question: This specific function does not seem docker-specific. Should it be moved to somewhere like crates/openshell-core/src/gpu.rs if it is also present in podman or vm implementations?

Done. The shared extraction helper now lives in openshell_core::gpu, and Docker/Podman/Kubernetes/VM use the common driver_gpu_requirement path.

elezar · 2026-06-03T19:27:59Z

+        if let Some(gpu) = sandbox.spec.as_ref().and_then(driver_gpu_requirement)
+            && gpu_has_explicit_device_ids(Some(gpu))
+        {
+            return Err(KubernetesDriverError::Precondition(
+                "kubernetes compute driver does not support explicit GPU device IDs".to_string(),
+            ));
+        }


Why are we seemingly duplicating the sandbox validation here?

elezar · 2026-06-03T19:29:08Z

            sandbox_template_to_k8s(
                &SandboxTemplate::default(),
-                spec.is_some_and(|s| s.gpu),
+                spec.and_then(driver_gpu_requirement),


What is the difference between spec.and_then(driver_gpu_requirement) and just calling driver_gpu_requirement(spec). If they're equivalent, can we settle on one of these for consistency?

elezar · 2026-06-03T19:35:48Z

    static ENV_LOCK: std::sync::LazyLock<std::sync::Mutex<()>> =
        std::sync::LazyLock::new(|| std::sync::Mutex::new(()));

+    fn gpu_request(count: Option<u32>) -> DriverGpuResourceRequirement {


Why do we have multiple mechanisms to construct a DriverGpuResourceRequirement in the kubernetes driver? This particular method seems to only be used in testing. If this is the case, is there a way to mark it as such. Furthermore, if we could "normalize" the construction, so that we always construct a DriverGpuResourceRequirement internally with the default count (i.e. the default request), then we don't need to remember to handle the special case when Count is not specified.

Done. The construction helpers are now local to the Kubernetes test module, with default_gpu_request() for the default request and gpu_request(count) only for count-specific cases.

elezar · 2026-06-03T19:36:55Z

+    fn gpu_has_explicit_device_ids_only_when_ids_are_present() {
+        assert!(!gpu_has_explicit_device_ids(None));
+        assert!(!gpu_has_explicit_device_ids(Some(
+            &DriverGpuResourceRequirement {
+                device_ids: vec![],
+                count: None,
+            }
+        )));
+        assert!(gpu_has_explicit_device_ids(Some(
+            &DriverGpuResourceRequirement {
+                device_ids: vec!["nvidia.com/gpu=0".to_string()],
+                count: None,
+            }
+        )));
+    }


Why are we testing this in the kubernetes driver? Explicit device IDs are not supported here.

Done. The Kubernetes test now verifies that explicit GPU device IDs are rejected, since Kubernetes does not support device-specific requests in this PR.

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 8b95e09 to 3d17c4e Compare May 13, 2026 16:00

This was referenced May 13, 2026

feat: add GPU count support for Kubernetes sandboxes #1338

Open

feat(sandbox): add typed resource spec #1340

Closed

elezar force-pushed the feat/gpu-request-spec branch from 3d17c4e to 930c581 Compare May 13, 2026 16:44

elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 07b171d to dd18d21 Compare May 15, 2026 15:12

elezar marked this pull request as ready for review May 15, 2026 15:12

elezar requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 15, 2026 15:12

elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 9ff168a to cfceac9 Compare May 18, 2026 13:28

elezar added the test:e2e-gpu Requires GPU end-to-end coverage label May 18, 2026

elezar force-pushed the feat/gpu-request-spec branch 3 times, most recently from ff9e99d to 76af3a3 Compare May 20, 2026 11:57

elezar mentioned this pull request May 20, 2026

GPU-accelerated Sandbox Support #1444

Open

9 tasks

elezar force-pushed the feat/gpu-request-spec branch 4 times, most recently from 7f09b83 to 3c6a501 Compare June 1, 2026 21:15

drew reviewed Jun 3, 2026

View reviewed changes

elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 7bdaa73 to 23bdbb1 Compare June 3, 2026 11:29

elezar mentioned this pull request Jun 3, 2026

fix(gpu): select single CDI GPU defaults #1675

Open

2 tasks

elezar mentioned this pull request Jun 3, 2026

feat(gpu): route device selection through driver config #1716

Draft

8 tasks