Skip to content

feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703

Open
st-gr wants to merge 4 commits into
NVIDIA:mainfrom
st-gr:feat/external-compute-driver-socket
Open

feat(server): support out-of-tree compute drivers via --compute-driver-socket#1703
st-gr wants to merge 4 commits into
NVIDIA:mainfrom
st-gr:feat/external-compute-driver-socket

Conversation

@st-gr
Copy link
Copy Markdown

@st-gr st-gr commented Jun 3, 2026

Summary

Adds an opt-in --compute-driver-socket=<path> flag (env OPENSHELL_COMPUTE_DRIVER_SOCKET) to the gateway. When set, the gateway dispatches sandbox lifecycle to an out-of-tree compute driver speaking the existing compute_driver.proto contract over a Unix domain socket, instead of one of the four built-in drivers (Kubernetes / Podman / Docker / VM).

The implementation deliberately avoids adding a fifth ComputeDriverKind variant — the gateway treats out-of-tree drivers as an acquired endpoint rather than a parallel kind in the in-tree enum. This was rewritten in response to review feedback from @elezar and @drew (see "Review history" below).

Related Issue

Supersedes the closed draft #1604 (same author, same patch shape — re-opened as a non-draft after rebasing onto current main, with all commits DCO-signed and st-gr-attributed). Same direction as cheese-head's vouch in #1345 ("extend or provide new out-of-tree compute drivers to OpenShell").

No upstream tracking issue filed — happy to file one if reviewers prefer.

Changes

  • crates/openshell-core/src/config.rs: adds Config.external_compute_driver_socket: Option<PathBuf> plus with_external_compute_driver_socket(...). No new ComputeDriverKind variant. The four in-tree variants remain Copy and bare.
  • crates/openshell-server/src/cli.rs: adds --compute-driver-socket=<path> flag (env OPENSHELL_COMPUTE_DRIVER_SOCKET) on RunArgs. When set, effective_single_driver returns None, short-circuiting both --drivers and the auto-detect probe so callers keyed off the in-tree enum skip the match.
  • crates/openshell-server/src/compute/mod.rs:
    • ComputeRuntime::from_driver now takes Option<ComputeDriverKind>; the four in-tree constructors wrap their kind in Some(...). Out-of-tree drivers pass None.
    • connect_external_compute_driver(&Path) produces a tonic Channel over a pre-existing UDS, mirroring the connector pattern the VM driver already uses (hyper-util::TokioIo + tokio::net::UnixStream).
    • ComputeRuntime::new_remote_external consumes the channel via the existing RemoteComputeDriver proxy and skips shutdown cleanup / managed-process supervision (the operator owns the lifecycle).
    • from_driver logs the driver_name advertised by GetCapabilities whenever driver_kind is None, so operators can confirm the gateway connected to the driver they expect (logged for diagnostics — not validated against operator-supplied input in this PR).
  • crates/openshell-server/src/lib.rs: build_compute_runtime short-circuits to connect_external_compute_driver + new_remote_external when Config.external_compute_driver_socket is set, before consulting --drivers / auto-detect. The four in-tree backends keep their existing dispatch arms unchanged.
  • architecture/compute-runtimes.md: adds an "External" row to the Runtime Summary table (activation flag, GetCapabilities.driver_name logging, operator-owned process+socket lifecycle, trust boundary = filesystem permissions on the UDS) and a matching row in Supervisor Delivery.

No compute_driver.proto changes. No new workspace dependencies (hyper-util and tower are already in scope; tokio already exposes UnixStream under cfg(unix)). No security-model changes.

Testing

  • mise run pre-commit passes — not run end-to-end (mise not on author's dev environment), but the equivalent rust pieces verified independently in a rust:1.95.0-slim-bookworm dev container with the workspace's apt deps: cargo fmt --all -- --check clean; cargo clippy --no-deps --workspace --all-targets -- -D warnings clean; cargo test --workspace --lib --bins passes (incl. all 769 openshell-server lib tests).
  • Unit tests added/updated — 3 new CLI-arg tests in crates/openshell-server/src/cli.rs::tests: compute_driver_socket_flag_populates_run_args, compute_driver_socket_overrides_drivers_flag, compute_driver_socket_reads_from_env_var. The four existing in-tree compute-driver tests (configured_compute_driver_*) continue to pass unchanged.
  • E2E tests added/updated — none — running an External driver in CI would require shipping a stub driver binary; deferring until at least one in-tree consumer wants to test against it. The patch is exercised in production at SAP via the st-gr/openshell-driver-kyma reference driver running against a Gardener-managed Kyma cluster since 2026-05-28 (sandbox CR creation, openshell sandbox exec, openshell sandbox upload/download all work).

Checklist

  • Follows Conventional Commits (feat(core):, feat(server):, docs(compute):)
  • Commits are signed off (DCO)
  • Architecture docs updated — added "External" row to architecture/compute-runtimes.md::Runtime Summary and ::Supervisor Delivery. Per-runtime implementation-notes section is intentionally NOT extended for external drivers because they ship their own documentation; the reference implementation lives at st-gr/openshell-driver-kyma.

Review history

The original revision of this PR added ComputeDriverKind::External(PathBuf) as a fifth enum variant. @elezar and @drew both flagged this as the wrong shape — out-of-tree drivers are an acquired endpoint rather than a parallel kind. The current revision implements that direction:

  • No External enum variant. ComputeDriverKind is unchanged; out-of-tree dispatch lives on Config.external_compute_driver_socket and goes through a separate construction path that produces a ComputeRuntime with driver_kind = None.
  • (name, socket) tuple syntax is deferred. The flag remains a single path. GetCapabilities.driver_name is logged for diagnostics, not validated against operator input. This matches @drew's suggestion to defer name-keying / <name>@<socket> to a follow-up that aligns with the broader name-keyed driver registry direction (docs(rfc): add driver config passthrough proposal #1589).
  • Construction / consumption split. The VM driver still spawns its own subprocess; external drivers consume an operator-owned channel. Both paths share RemoteComputeDriver for the actual proto consumption — the only divergence is at construction.

Happy to revisit any of those choices if reviewers prefer a different cut point.

Notes for reviewers

Out of scope (intentional):

  • No new in-tree compute driver implementations. This PR only adds the activation flag + connector + runtime split. Operators bring their own driver process and point --compute-driver-socket at its socket path.
  • No compute_driver.proto changes. The existing protocol is reused verbatim — the same one already used by the in-tree VM driver to talk to its tonic peer.
  • No security model changes. UDS endpoint security is the operator's responsibility (filesystem permissions, sidecar isolation), matching the --drivers vm posture.

Operator context:

The forked gateway image at ghcr.io/st-gr/openshell-gateway (carrying these patches plus an in-tree Dockerfile.gateway) has been driving production-shaped Kubernetes clusters with the st-gr/openshell-driver-kyma compute driver since 2026-05-28. The flag has needed zero changes since it landed — the API shape is stable. Merging this PR lets the driver consume the unmodified upstream gateway image, retiring the fork.

@st-gr st-gr requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners June 3, 2026 05:27
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Thank you for your submission! We ask that you all sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by adding a comment below using this text:


I have read the DCO document and I hereby sign the DCO.


1 out of 2 committers have signed the DCO.
✅ (st-gr)[https://github.com/st-gr]
@Z-I-Z-I
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the DCO Assistant Lite bot.

@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

I have read the DCO document and I hereby sign the DCO.

@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

recheck

@drew drew self-assigned this Jun 3, 2026
@drew
Copy link
Copy Markdown
Collaborator

drew commented Jun 3, 2026

/ok to test 8689967

@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from 8689967 to 464018b Compare June 3, 2026 08:18
@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

Pushed a doc_markdown fix (added backticks around compute_driver.proto and OPENSHELL_COMPUTE_DRIVER_SOCKET in the External variant's doc comment, folded into the originating feat(core): commit via autosquash).

/ok to test 464018b7 — superseded by a second force-push (redundant_pub_crate fix); see the follow-up comment below for the current SHA.

st-gr added a commit to st-gr/OpenShell that referenced this pull request Jun 3, 2026
NVIDIA/OpenShell's Cargo.toml has

    [workspace.lints.clippy]
    pedantic = "warn"
    nursery = "warn"
    all = "warn"

so `cargo clippy --workspace -- -D warnings` escalates pedantic lints
(including `doc_markdown`) to errors. Their CI uses this as part of
`mise run pre-commit`. The fork's existing `release-gateway.yml` only
runs `docker build` (which calls `cargo build --release` — no clippy),
so the same lints weren't enforced on the fork.

PR NVIDIA#1703 caught us with two `doc_markdown` violations
in `crates/openshell-core/src/config.rs:56` (`compute_driver.proto` and
`OPENSHELL_COMPUTE_DRIVER_SOCKET` missing backticks in the External
variant's doc comment). The fork's CI didn't catch it; NVIDIA's did.

This workflow runs on push to main + `feat/**` branches and on
workflow_dispatch. It pins to Rust 1.95.0 (matches upstream mise.toml)
and installs the same native deps `Dockerfile.gateway` needs
(libclang-dev + libz3-dev for transitive bindgen + z3-sys).

Workflow is light: ubuntu-latest runner, ~10-15 min cold, ~3-5 min
with the Swatinem/rust-cache hot. Path-filtered so doc-only changes
don't trigger it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from 464018b to 1c6663d Compare June 3, 2026 09:01
@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

Force-pushed once more — caught a second pedantic clippy lint locally that the previous run didn't get to (redundant_pub_crate on connect_external_compute_driver). Folded into the originating feat(server): wire ... commit via autosquash so the PR stays at 5 commits.

New HEAD: 1c6663d. Verified clean with the full-workspace check NVIDIA's CI runs:

      cargo fmt --all -- --check
      cargo clippy --workspace --all-targets -- -D warnings

Could I get a re-vet with /ok to test 1c6663d3?

Apologies for the round-trip — I've added a rust-lint workflow to my fork to mirror your mise run pre-commit so future pushes catch these locally before they reach your runners.

Copy link
Copy Markdown
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking ahead to changes such as #1589 and support for multiple drivers in future, does it make sense to formulate this as a "named driver endpoint" where a user would supply a driver name and a socket path? Especially given that the in-tree vm driver is already an example of such a driver and we would only be extending the existing functionality.

In practice, the gateway would select a driver by name (instead of a kind enum) and one could use GetCapabilities.driver_name to validate the user-provided name if required.

@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

@elezar
I read through #1589 / RFC 0005. The named-endpoint shape and GetCapabilities.driver_name validation both fit cleanly. Two ways to implement your suggestion in this PR:

A. Minimal add. Keep External(PathBuf) and add a paired --compute-driver-name=<name> flag. After the UDS connect, the gateway calls GetCapabilities, compares response.driver_name to the supplied name, and errors loudly on mismatch. The enum surface stays flat. Could serve as a stepping stone; when #1589's registry lands, it can replace the enum entirely without re-migrating this code.

B. Restructured External. Extend the variant to External { name: String, socket: PathBuf }. More visible signal of the multi-driver direction at the type level, but touches FromStr / Display / their tests.

Replacing ComputeDriverKind itself with a name-keyed registry (your second paragraph's "select a driver by name instead of a kind enum") reads to me as the actual #1589 implementation. I can work on that as a follow-up once the RFC has a concrete shape, but that's a much bigger refactor than this PR.

I'd lean (A) because it doesn't pre-commit the data model to a shape #1589 might want to rewrite, and it costs ~30 LOC. Either is fine, though. Let me know which you'd prefer to see here.

@elezar
Copy link
Copy Markdown
Member

elezar commented Jun 3, 2026

As a quick follow-up: What would it look like if we first migrate to implementing the vm driver as a specific-instance of a NamedDriver{ name: String, socket: PathBuf } where name == "vm" and then generalize from here?

Also note, that although #1589 is mentioned here as a motivator for restructuring how drivers are managed from a gateway perspective, moving to a name-keyed registry for managing drivers is not specifically related to the driver-specific config.

@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 3, 2026

Right, separating from #1589 makes sense, the named-driver registry is its own concern.

Confirmed against the #1051 OpenShell Drivers roadmap: Its stated intent of "the core OpenShell team maintains a small deployment footprint and third parties may develop their own drivers" is exactly the direction your NamedDriver proposal serves. So aligning this PR there feels right; I won't propose alternatives that pull Kyma (or any other operator-specific driver) in-tree.

Here's what Phase A could look like: A pure refactor that pulls today's Vm arm into a generalized shape, no behavior change for existing --drivers vm operators:

    pub struct NamedDriver {
        pub name: String,
        pub socket: PathBuf,
        /// Set when the gateway owns the driver's lifecycle (current
        /// `vm` behavior - `compute::vm::spawn` returning a
        /// `ManagedDriverProcess`). When `None`, the operator runs the
        /// driver out-of-process and supplies the socket path.
        pub spawn: Option<DriverSpawnConfig>,
    }

    pub enum ComputeDriverKind {
        Kubernetes,
        Docker,
        Podman,
        Named(NamedDriver),
    }

Maps cleanly to the existing wiring at crates/openshell-server/src/compute/mod.rs:

  • --drivers vm translates to Named(NamedDriver { name: "vm", socket: <generated under XDG_STATE>, spawn: Some(default_vm_spawn()) }). The CLI front-end keeps the vm alias for backward compat.
  • ManagedDriverProcess (lines 100-128) stays as-is - it's already the lifecycle holder for spawned drivers; it just gets reached via the Named arm instead of being conditional on ComputeDriverKind::Vm.
  • new_remote_vm (line 388) becomes new_remote_named - same signature, same Option<Arc<ManagedDriverProcess>> field.
  • GetCapabilities.driver_name validation lands here: gateway compares Named.name against the proto response and errors at startup on mismatch.

Then Phase B (this PR) becomes additive:

  • --compute-driver=<name>@<path> (or two flags) → Named(NamedDriver { name, socket, spawn: None }).
  • All the dispatch, lifecycle-watcher, and capability-check logic is shared with the VM path.

Ordering: I'd implement Phase A first as a precursor PR (pure refactor, no behavior change since --drivers vm still works through the alias), then rebase this PR on top. Each piece reviews more cleanly on its own. Happy to bundle into one PR if you'd rather see the full motivation in
a single diff.

Before I start coding, anything about the NamedDriver shape you'd want different? Specifically:

  • The spawn: Option<DriverSpawnConfig> distinction (gateway-managed vs operator-managed) - does that match how you'd want the boundary drawn?
  • name: String vs name: &'static str for the in-tree vm case - string keeps the field uniform with operator-supplied names; static would mean a separate type.
  • Whether Kubernetes / Docker / Podman should follow into Named(...) in this same precursor PR or stay as enum arms (they don't speak compute_driver.proto over UDS - they're in-tree adapters - so probably out of scope).

Once we converge on the shape, Phase A is straightforward.

@elezar
Copy link
Copy Markdown
Member

elezar commented Jun 4, 2026

Thanks, this is the right direction, but I'd separate the pieces slightly differently.

For Phase A, I don't think spawn should be part of the named driver definition. A named endpoint should describe how the gateway consumes a remote driver: { name, socket }. Whether that socket came from a gateway-spawned VM driver or from an operator-managed external process is a construction/lifecycle concern.

For the VM refactor in Phase A, the shape I'd like to see is:

  1. Resolve --drivers vm.
  2. Run the VM-specific construction path: read VM config, prepare the socket path, spawn openshell-driver-vm, and wait for readiness.
  3. Treat the result as an acquired named endpoint plus a lifecycle guard.
  4. From that point on, use the generic endpoint path: connect to the socket, wrap the channel in RemoteComputeDriver, and pass it into ComputeRuntime as a normal SharedComputeDriver.

Conceptually:

VM config -> spawn VM driver -> acquired named endpoint
acquired named endpoint -> RemoteComputeDriver -> ComputeRuntime

For Phase B, an external driver should enter at the same acquired-endpoint boundary:

external config -> existing socket -> acquired named endpoint
acquired named endpoint -> RemoteComputeDriver -> ComputeRuntime

That keeps driver construction separate from driver consumption. The VM lifecycle guard can own process/socket cleanup; an external driver can use a no-op lifecycle guard. ComputeRuntime should not need to care which acquisition path produced the driver.

I'd also prefer to avoid a special External enum variant if possible. The <name>@<socket> shape you proposed seems like a good fit here, especially if we expect to support multiple configured drivers later. With that said, I'd like @drew or @johntmyers to weigh in here in terms of the UX of this. We may not want to overload the existing flag, but then we need to consider how we want to handle multiple drivers and their mapping to endpoints.

Reserved names like vm, docker, podman, and kubernetes can keep their current meaning when supplied bare. Endpoint-based drivers can use the explicit endpoint form:

--drivers vm
--drivers vm@/path/to/compute-driver.sock
--drivers kyma@/path/to/compute-driver.sock

In that model, bare vm keeps the managed VM default path. vm@<socket> uses the VM construction path but overrides the managed socket path (if we wanted to allow this). A non-reserved name such as kyma@<socket> is an unmanaged out-of-tree endpoint-based driver. A non-reserved bare name should error until there is some other way to resolve its endpoint.

For the initial parser, I'd keep the rules narrow: parse the current comma-delimited --drivers entries independently, split endpoint entries on the first @, and reject empty names or empty socket paths. docker@..., podman@..., and kubernetes@... can remain reserved/errors for now unless we explicitly add endpoint-backed versions of those drivers.

One follow-up to call out is the gateway TOML mapping. Today [openshell.gateway].compute_drivers is the structured driver selection list, while [openshell.drivers.<name>] is intentionally driver-owned/raw TOML. If we accept <name>@<socket> on the CLI, we should decide whether the TOML uses the same string form:

[openshell.gateway]
compute_drivers = ["kyma@/run/openshell/kyma.sock"]

or a more structured form under the driver table:

[openshell.gateway]
compute_drivers = ["kyma"]

[openshell.drivers.kyma]
endpoint = "/run/openshell/kyma.sock"

I don't think that has to block Phase A, but it should be discussed and clarified before Phase B lands so CLI/env/TOML do not grow incompatible shapes.

Name matching via GetCapabilities.driver_name can be deferred as a later hardening step. The initial refactor only needs to establish the construction/consumption split and the common endpoint-based consumption path.

@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 5, 2026

Thanks @elezar — agree the named-driver-endpoint framing is the right shape, and lines up with #1589's name-keyed driver_config. Concrete proposal for this PR:

  • ComputeDriverKind::External(PathBuf)ComputeDriverKind::External { name: String, socket: PathBuf }. Other variants stay bare.
  • CLI: add --compute-driver-name=<name> (env OPENSHELL_COMPUTE_DRIVER_NAME) paired with the existing --compute-driver-socket. Both required when either is set; the pair pins External { name, socket } and skips --drivers and auto-detect.
  • On startup, after the tonic channel connects, the gateway calls GetCapabilities and fails fast if response.driver_name != name.

One scope question before I push: should I generalise the dispatch beyond External in this PR — e.g. fold Vm into the same name-keyed path so the registry is uniform — or keep that for the #1589 follow-up and leave the four in-tree variants as-is for now? My read is the latter (this PR stays narrow; #1589 / a successor RFC handles the registry), but happy to expand if you'd prefer it land in one go.

@drew
Copy link
Copy Markdown
Collaborator

drew commented Jun 5, 2026

I'd also prefer to avoid a special External enum variant if possible. The <name>@<socket> shape you proposed seems like a good fit here, especially if we expect to support multiple configured drivers later.

+1 to avoiding the External enum variant. I like the <name>@<socket> pattern, though I do worry a bit about overloading the string.

We will eventually want to support multiple drivers, and possibly different versions of the same driver.

That said, can we add the <name>@<socket> convention in a followup PR without breaking contracts? Would be good to see how external drivers gets used before including including additional overloads to the path. In the meantime we can rely on GetCapabilities.driver_name and assume it will be unique.

@drew
Copy link
Copy Markdown
Collaborator

drew commented Jun 5, 2026

@st-gr

One scope question before I push: should I generalise the dispatch beyond External in this PR — e.g. fold Vm into the same name-keyed path so the registry is uniform — or keep that for the #1589 follow-up and leave the four in-tree variants as-is for now? My read is the latter (this PR stays narrow; #1589 / a successor RFC handles the registry), but happy to expand if you'd prefer it land in one go.

+1 to your read. Let's leave the four in-tree for now.

@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 5, 2026

@elezar @drew — reading the convergence:

This PR:

  • Extract the construction/consumption split as @elezar described: VM and external both produce (channel, advertised_name) via dedicated acquisition paths; consumption (RemoteComputeDriverComputeRuntime::from_driver) is shared. VM construction keeps its ManagedDriverProcess lifecycle guard on its side of the boundary — endpoint type stays slim, no spawn field. UX for --drivers vm unchanged.
  • Drop ComputeDriverKind::External(PathBuf). Restore upstream Copy + const fn as_str. External driver is signalled by a separate Config.compute_driver_socket: Option<PathBuf> that takes precedence over compute_drivers in build_compute_runtime.
  • Keep --compute-driver-socket=<path> (env OPENSHELL_COMPUTE_DRIVER_SOCKET) as the single external flag — no <name>@<socket>, no user-supplied name. Gateway calls GetCapabilities once after connect and logs the advertised driver_name (no validation, per @drew).
  • Bare --drivers vm/docker/podman/kubernetes unchanged.

Deferred to follow-up(s): <name>@<socket> syntax + reserved-name discipline, multi-driver / multi-version, TOML shape choice, GetCapabilities.driver_name matching against a supplied name.

Branch will be rewritten as: (1) refactor extracting the boundary, (2) wire --compute-driver-socket through it, (3) drop External + revert the enum changes, (4) refresh the arch doc. Will start coding and force-push a new revision; flag if anything's misread.

Z-I-Z-I and others added 3 commits June 5, 2026 03:19
Carries the path to an out-of-tree compute driver's Unix domain
socket. When set, the gateway will dispatch sandbox lifecycle to
that driver instead of one of the in-tree backends, taking
precedence over `compute_drivers` and auto-detection.

The field lives on `Config` rather than as a `ComputeDriverKind`
variant so the four in-tree variants (`Kubernetes`, `Vm`, `Docker`,
`Podman`) keep their flat shape and the enum stays `Copy`.

Plumbing only — the dispatch wiring lands in a follow-up commit.

Signed-off-by: st-gr <124497920+st-gr@users.noreply.github.com>
Adds an optional `--compute-driver-socket=<path>` flag (env
`OPENSHELL_COMPUTE_DRIVER_SOCKET`) on `RunArgs`, plumbed through to
`Config.external_compute_driver_socket`. The flag activates dispatch
to an out-of-tree compute driver listening on a Unix domain socket and
takes precedence over both the `--drivers` list and the auto-detection
probe; `effective_single_driver` returns `None` when set so callers
keyed off `ComputeDriverKind` skip the in-tree match.

Tests cover flag parsing, env-var fallback, and the override of
`--drivers`. The driver name advertised in `GetCapabilities` is logged
for diagnostics in a follow-up commit that wires the channel.

Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Splits ComputeRuntime construction so that the gateway can dispatch to
an out-of-tree compute driver process listening on a Unix domain socket
the operator already owns, without adding a fifth `ComputeDriverKind`
variant.

* `ComputeRuntime::from_driver` now takes `Option<ComputeDriverKind>`;
  the four in-tree constructors wrap their kind in `Some(...)`. Out-of-
  tree drivers pass `None` so callers keyed off the enum skip the
  in-tree match.
* `connect_external_compute_driver` produces a tonic `Channel` over a
  pre-existing UDS path, mirroring the vm driver's connector. A
  `#[cfg(not(unix))]` stub returns the same error the vm path uses.
* `ComputeRuntime::new_remote_external` consumes the channel via the
  existing `RemoteComputeDriver` proxy and skips both shutdown cleanup
  and managed-process supervision (the operator owns the lifecycle).
* `from_driver` logs the `driver_name` advertised by `GetCapabilities`
  whenever `driver_kind` is `None`, so operators can confirm the
  gateway connected to the driver they expect.
* `build_compute_runtime` short-circuits to the external path when
  `Config.external_compute_driver_socket` is set, before consulting
  `--drivers` / auto-detect.

The four in-tree backends (Kubernetes, Vm, Docker, Podman) keep their
existing dispatch arms unchanged.

Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
Adds the External row to the runtime summary and supervisor delivery
tables: activated by --compute-driver-socket, GetCapabilities driver_name
logged for diagnostics, operator owns process and socket lifecycle, trust
boundary is the socket's filesystem permissions.

Also picks up rustfmt's normalization of the new imports and helper
signatures introduced in the previous two commits.

Signed-off-by: st-gr <38470677+st-gr@users.noreply.github.com>
@st-gr st-gr force-pushed the feat/external-compute-driver-socket branch from 1c6663d to be90905 Compare June 5, 2026 11:12
@st-gr
Copy link
Copy Markdown
Author

st-gr commented Jun 5, 2026

@elezar @drew — pushed the rewrite. PR body has been refreshed end-to-end to match the new shape. Summary of what changed since your last comments:

  • No External enum variant. ComputeDriverKind is unchanged. Out-of-tree dispatch now lives on Config.external_compute_driver_socket: Option<PathBuf> and goes through a separate construction path that produces a ComputeRuntime with driver_kind: None.
  • Construction / consumption split. ComputeRuntime::from_driver now takes Option<ComputeDriverKind>; the four in-tree constructors wrap their kind in Some(...), the new new_remote_external passes None. Both paths share the existing RemoteComputeDriver proxy for proto consumption.
  • (name, socket) tuple deferred per @drew's guidance — flag stays a single path, GetCapabilities.driver_name is logged for diagnostics only (no validation against operator input). Name-keying / <name>@<socket> is a follow-up that aligns with docs(rfc): add driver config passthrough proposal #1589.
  • Workspace verification: cargo fmt --check, cargo clippy --workspace --all-targets -D warnings, and cargo test --workspace --lib --bins all clean (incl. all 769 openshell-server lib tests).

Branch is force-pushed; commits are squash-friendly (4 small, named per conventional-commits). Happy to revisit any cut point if you'd prefer a different shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants