feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703
feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703st-gr wants to merge 4 commits into
Conversation
|
Thank you for your submission! We ask that you all sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by adding a comment below using this text: I have read the DCO document and I hereby sign the DCO. 1 out of 2 committers have signed the DCO. |
|
I have read the DCO document and I hereby sign the DCO. |
|
recheck |
|
/ok to test 8689967 |
8689967 to
464018b
Compare
|
Pushed a
|
NVIDIA/OpenShell's Cargo.toml has
[workspace.lints.clippy]
pedantic = "warn"
nursery = "warn"
all = "warn"
so `cargo clippy --workspace -- -D warnings` escalates pedantic lints
(including `doc_markdown`) to errors. Their CI uses this as part of
`mise run pre-commit`. The fork's existing `release-gateway.yml` only
runs `docker build` (which calls `cargo build --release` — no clippy),
so the same lints weren't enforced on the fork.
PR NVIDIA#1703 caught us with two `doc_markdown` violations
in `crates/openshell-core/src/config.rs:56` (`compute_driver.proto` and
`OPENSHELL_COMPUTE_DRIVER_SOCKET` missing backticks in the External
variant's doc comment). The fork's CI didn't catch it; NVIDIA's did.
This workflow runs on push to main + `feat/**` branches and on
workflow_dispatch. It pins to Rust 1.95.0 (matches upstream mise.toml)
and installs the same native deps `Dockerfile.gateway` needs
(libclang-dev + libz3-dev for transitive bindgen + z3-sys).
Workflow is light: ubuntu-latest runner, ~10-15 min cold, ~3-5 min
with the Swatinem/rust-cache hot. Path-filtered so doc-only changes
don't trigger it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
464018b to
1c6663d
Compare
|
Force-pushed once more — caught a second pedantic clippy lint locally that the previous run didn't get to ( New HEAD: 1c6663d. Verified clean with the full-workspace check NVIDIA's CI runs: Could I get a re-vet with Apologies for the round-trip — I've added a |
elezar
left a comment
There was a problem hiding this comment.
Thinking ahead to changes such as #1589 and support for multiple drivers in future, does it make sense to formulate this as a "named driver endpoint" where a user would supply a driver name and a socket path? Especially given that the in-tree vm driver is already an example of such a driver and we would only be extending the existing functionality.
In practice, the gateway would select a driver by name (instead of a kind enum) and one could use GetCapabilities.driver_name to validate the user-provided name if required.
|
@elezar A. Minimal add. Keep B. Restructured Replacing I'd lean (A) because it doesn't pre-commit the data model to a shape #1589 might want to rewrite, and it costs ~30 LOC. Either is fine, though. Let me know which you'd prefer to see here. |
|
As a quick follow-up: What would it look like if we first migrate to implementing the Also note, that although #1589 is mentioned here as a motivator for restructuring how drivers are managed from a gateway perspective, moving to a name-keyed registry for managing drivers is not specifically related to the driver-specific config. |
|
Right, separating from #1589 makes sense, the named-driver registry is its own concern. Confirmed against the #1051 OpenShell Drivers roadmap: Its stated intent of "the core OpenShell team maintains a small deployment footprint and third parties may develop their own drivers" is exactly the direction your Here's what Phase A could look like: A pure refactor that pulls today's Maps cleanly to the existing wiring at
Then Phase B (this PR) becomes additive:
Ordering: I'd implement Phase A first as a precursor PR (pure refactor, no behavior change since Before I start coding, anything about the
Once we converge on the shape, Phase A is straightforward. |
|
Thanks, this is the right direction, but I'd separate the pieces slightly differently. For Phase A, I don't think For the VM refactor in Phase A, the shape I'd like to see is:
Conceptually: For Phase B, an external driver should enter at the same acquired-endpoint boundary: That keeps driver construction separate from driver consumption. The VM lifecycle guard can own process/socket cleanup; an external driver can use a no-op lifecycle guard. I'd also prefer to avoid a special Reserved names like In that model, bare For the initial parser, I'd keep the rules narrow: parse the current comma-delimited One follow-up to call out is the gateway TOML mapping. Today [openshell.gateway]
compute_drivers = ["kyma@/run/openshell/kyma.sock"]or a more structured form under the driver table: [openshell.gateway]
compute_drivers = ["kyma"]
[openshell.drivers.kyma]
endpoint = "/run/openshell/kyma.sock"I don't think that has to block Phase A, but it should be discussed and clarified before Phase B lands so CLI/env/TOML do not grow incompatible shapes. Name matching via |
|
Thanks @elezar — agree the named-driver-endpoint framing is the right shape, and lines up with #1589's name-keyed
One scope question before I push: should I generalise the dispatch beyond |
+1 to avoiding the External enum variant. I like the We will eventually want to support multiple drivers, and possibly different versions of the same driver. That said, can we add the |
+1 to your read. Let's leave the four in-tree for now. |
|
@elezar @drew — reading the convergence: This PR:
Deferred to follow-up(s): Branch will be rewritten as: (1) refactor extracting the boundary, (2) wire |
Carries the path to an out-of-tree compute driver's Unix domain socket. When set, the gateway will dispatch sandbox lifecycle to that driver instead of one of the in-tree backends, taking precedence over `compute_drivers` and auto-detection. The field lives on `Config` rather than as a `ComputeDriverKind` variant so the four in-tree variants (`Kubernetes`, `Vm`, `Docker`, `Podman`) keep their flat shape and the enum stays `Copy`. Plumbing only — the dispatch wiring lands in a follow-up commit. Signed-off-by: st-gr <124497920+st-gr@users.noreply.github.com>
Adds an optional `--compute-driver-socket=<path>` flag (env `OPENSHELL_COMPUTE_DRIVER_SOCKET`) on `RunArgs`, plumbed through to `Config.external_compute_driver_socket`. The flag activates dispatch to an out-of-tree compute driver listening on a Unix domain socket and takes precedence over both the `--drivers` list and the auto-detection probe; `effective_single_driver` returns `None` when set so callers keyed off `ComputeDriverKind` skip the in-tree match. Tests cover flag parsing, env-var fallback, and the override of `--drivers`. The driver name advertised in `GetCapabilities` is logged for diagnostics in a follow-up commit that wires the channel. Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Splits ComputeRuntime construction so that the gateway can dispatch to an out-of-tree compute driver process listening on a Unix domain socket the operator already owns, without adding a fifth `ComputeDriverKind` variant. * `ComputeRuntime::from_driver` now takes `Option<ComputeDriverKind>`; the four in-tree constructors wrap their kind in `Some(...)`. Out-of- tree drivers pass `None` so callers keyed off the enum skip the in-tree match. * `connect_external_compute_driver` produces a tonic `Channel` over a pre-existing UDS path, mirroring the vm driver's connector. A `#[cfg(not(unix))]` stub returns the same error the vm path uses. * `ComputeRuntime::new_remote_external` consumes the channel via the existing `RemoteComputeDriver` proxy and skips both shutdown cleanup and managed-process supervision (the operator owns the lifecycle). * `from_driver` logs the `driver_name` advertised by `GetCapabilities` whenever `driver_kind` is `None`, so operators can confirm the gateway connected to the driver they expect. * `build_compute_runtime` short-circuits to the external path when `Config.external_compute_driver_socket` is set, before consulting `--drivers` / auto-detect. The four in-tree backends (Kubernetes, Vm, Docker, Podman) keep their existing dispatch arms unchanged. Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Adds the External row to the runtime summary and supervisor delivery tables: activated by --compute-driver-socket, GetCapabilities driver_name logged for diagnostics, operator owns process and socket lifecycle, trust boundary is the socket's filesystem permissions. Also picks up rustfmt's normalization of the new imports and helper signatures introduced in the previous two commits. Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
1c6663d to
be90905
Compare
|
@elezar @drew — pushed the rewrite. PR body has been refreshed end-to-end to match the new shape. Summary of what changed since your last comments:
Branch is force-pushed; commits are squash-friendly (4 small, named per conventional-commits). Happy to revisit any cut point if you'd prefer a different shape. |
Summary
Adds an opt-in
--compute-driver-socket=<path>flag (envOPENSHELL_COMPUTE_DRIVER_SOCKET) to the gateway. When set, the gateway dispatches sandbox lifecycle to an out-of-tree compute driver speaking the existingcompute_driver.protocontract over a Unix domain socket, instead of one of the four built-in drivers (Kubernetes / Podman / Docker / VM).The implementation deliberately avoids adding a fifth
ComputeDriverKindvariant — the gateway treats out-of-tree drivers as an acquired endpoint rather than a parallel kind in the in-tree enum. This was rewritten in response to review feedback from @elezar and @drew (see "Review history" below).Related Issue
Supersedes the closed draft #1604 (same author, same patch shape — re-opened as a non-draft after rebasing onto current
main, with all commits DCO-signed and st-gr-attributed). Same direction ascheese-head's vouch in #1345 ("extend or provide new out-of-tree compute drivers to OpenShell").No upstream tracking issue filed — happy to file one if reviewers prefer.
Changes
crates/openshell-core/src/config.rs: addsConfig.external_compute_driver_socket: Option<PathBuf>pluswith_external_compute_driver_socket(...). No newComputeDriverKindvariant. The four in-tree variants remainCopyand bare.crates/openshell-server/src/cli.rs: adds--compute-driver-socket=<path>flag (envOPENSHELL_COMPUTE_DRIVER_SOCKET) onRunArgs. When set,effective_single_driverreturnsNone, short-circuiting both--driversand the auto-detect probe so callers keyed off the in-tree enum skip the match.crates/openshell-server/src/compute/mod.rs:ComputeRuntime::from_drivernow takesOption<ComputeDriverKind>; the four in-tree constructors wrap their kind inSome(...). Out-of-tree drivers passNone.connect_external_compute_driver(&Path)produces a tonicChannelover a pre-existing UDS, mirroring the connector pattern the VM driver already uses (hyper-util::TokioIo+tokio::net::UnixStream).ComputeRuntime::new_remote_externalconsumes the channel via the existingRemoteComputeDriverproxy and skips shutdown cleanup / managed-process supervision (the operator owns the lifecycle).from_driverlogs thedriver_nameadvertised byGetCapabilitieswheneverdriver_kindisNone, so operators can confirm the gateway connected to the driver they expect (logged for diagnostics — not validated against operator-supplied input in this PR).crates/openshell-server/src/lib.rs:build_compute_runtimeshort-circuits toconnect_external_compute_driver+new_remote_externalwhenConfig.external_compute_driver_socketis set, before consulting--drivers/ auto-detect. The four in-tree backends keep their existing dispatch arms unchanged.architecture/compute-runtimes.md: adds an "External" row to the Runtime Summary table (activation flag,GetCapabilities.driver_namelogging, operator-owned process+socket lifecycle, trust boundary = filesystem permissions on the UDS) and a matching row in Supervisor Delivery.No
compute_driver.protochanges. No new workspace dependencies (hyper-utilandtowerare already in scope;tokioalready exposesUnixStreamundercfg(unix)). No security-model changes.Testing
mise run pre-commitpasses — not run end-to-end (misenot on author's dev environment), but the equivalent rust pieces verified independently in arust:1.95.0-slim-bookwormdev container with the workspace's apt deps:cargo fmt --all -- --checkclean;cargo clippy --no-deps --workspace --all-targets -- -D warningsclean;cargo test --workspace --lib --binspasses (incl. all 769openshell-serverlib tests).crates/openshell-server/src/cli.rs::tests:compute_driver_socket_flag_populates_run_args,compute_driver_socket_overrides_drivers_flag,compute_driver_socket_reads_from_env_var. The four existing in-tree compute-driver tests (configured_compute_driver_*) continue to pass unchanged.Externaldriver in CI would require shipping a stub driver binary; deferring until at least one in-tree consumer wants to test against it. The patch is exercised in production at SAP via thest-gr/openshell-driver-kymareference driver running against a Gardener-managed Kyma cluster since 2026-05-28 (sandbox CR creation,openshell sandbox exec,openshell sandbox upload/downloadall work).Checklist
feat(core):,feat(server):,docs(compute):)architecture/compute-runtimes.md::Runtime Summaryand::Supervisor Delivery. Per-runtime implementation-notes section is intentionally NOT extended for external drivers because they ship their own documentation; the reference implementation lives atst-gr/openshell-driver-kyma.Review history
The original revision of this PR added
ComputeDriverKind::External(PathBuf)as a fifth enum variant. @elezar and @drew both flagged this as the wrong shape — out-of-tree drivers are an acquired endpoint rather than a parallel kind. The current revision implements that direction:Externalenum variant.ComputeDriverKindis unchanged; out-of-tree dispatch lives onConfig.external_compute_driver_socketand goes through a separate construction path that produces aComputeRuntimewithdriver_kind = None.(name, socket)tuple syntax is deferred. The flag remains a single path.GetCapabilities.driver_nameis logged for diagnostics, not validated against operator input. This matches @drew's suggestion to defer name-keying /<name>@<socket>to a follow-up that aligns with the broader name-keyed driver registry direction (docs(rfc): add driver config passthrough proposal #1589).RemoteComputeDriverfor the actual proto consumption — the only divergence is at construction.Happy to revisit any of those choices if reviewers prefer a different cut point.
Notes for reviewers
Out of scope (intentional):
--compute-driver-socketat its socket path.compute_driver.protochanges. The existing protocol is reused verbatim — the same one already used by the in-tree VM driver to talk to its tonic peer.--drivers vmposture.Operator context:
The forked gateway image at
ghcr.io/st-gr/openshell-gateway(carrying these patches plus an in-treeDockerfile.gateway) has been driving production-shaped Kubernetes clusters with thest-gr/openshell-driver-kymacompute driver since 2026-05-28. The flag has needed zero changes since it landed — the API shape is stable. Merging this PR lets the driver consume the unmodified upstream gateway image, retiring the fork.