refactor(openshell-sandbox): Split `sandbox` into `process` and `network` subcrates. by rrhubenov · Pull Request #1650 · NVIDIA/OpenShell

rrhubenov · 2026-06-01T13:03:36Z

Summary

This PR is an implementation of #1305.
For full context on why this is required, please read #981 and #1305 in that order.

In summary, this split has been proposed for 2 main reasons:

As a precursor to a possible split-pod topology of the supervisor (previously sandbox)
In general a better decomposition of the 2 main "categories" of functionality ("process" and "network")

Although this change has been mostly evaluated from the perspective of the "split pod" topology, this splitting allows for a more agile topology regardless of environment. If for any reason the "process" and "network" functionality need to work as standalone processes, this split is required.

Review suggestions:

This refactor is hard to review due to a big collection of changes that are mostly movements of files and/or code blocks. For this reason, I suggest that reviewers take the approach of reviewing the changes commit by commit. I have taken care that each commit is as small as possible, moving only one functionality (module, function, etc.) in one commit.

Bear in mind that during the implementation, one very important architectural decision is that the 2 new crates (supervisor-network and supervisor-process) do not have a dependency on one another.
Only the "orcherstrator" crate, that being the openshell-supervisor should have a dependency on both of them.
One other important point was that no behavioural changes should occur as a side-effect from this refactoring.

NOTE:
I haven't renamed the openshell-sandbox crate to openshell-supervisor since if done fully (e.g. renaming the binary as well), would result in a breaking change which I see as out of scope for this PR.

I have also tried to minimise the amount of architectural design decisions in this PR, since it is already big enough.

Related Issue

Closes #1305

Changes

Testing

mise run pre-commit passes
Unit tests added/updated (no update was required due to this change being only a refactor) N/A
E2E tests added/updated (if applicable) N/A

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable) N/A - review determined that docs do not go into implementation details

Lifts TLS state generation, network namespace setup, proxy startup, bypass monitor spawn, and SSH-side proxy URL / netns FD computation out of run_sandbox into a sibling async fn `run_networking` that returns a Networking struct. The identity cache moves with it (only consumed by the proxy). Entrypoint PID allocation moves just above the call site so it can be passed in. No behavior changes — same OCSF emits, same async order, same RAII lifetimes for the proxy and bypass-monitor handles, now held by the returned Networking value in run_sandbox's frame. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Lifts the post-networking tail of `run_sandbox` (zombie reaper, SSH server, supervisor session, process spawn, OPA probe, policy poll loop, denial aggregator, wait/exit) into a sibling async fn `run_process`. Also moves network namespace creation out of `run_networking` into a new `create_netns_for_proxy` helper invoked from `run_sandbox`, so `run_networking` is purely the proxy component (OPA evaluation, TLS interception, credential injection, inference routing, gRPC control API). The netns is then borrowed into both `run_networking` and `run_process`. No behavior change. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…ell-supervisor-process crates Add empty placeholder crates that subsequent commits will populate as the sandbox decomposition proceeds. Both crates compile clean as part of the workspace and are picked up automatically by the existing `members = ["crates/*"]` glob. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

The DenialEvent struct is emitted by both the proxy/L7 layer (networking-side) and the bypass monitor (process-side), and crosses the run_networking -> run_process API boundary. Move it to openshell-core so the eventual supervisor-networking and supervisor-process crates can both reference it without depending on each other. DenialAggregator and the channel/flush helpers stay in openshell-sandbox for now. A thin `pub use openshell_core::DenialEvent;` re-export from denial_aggregator.rs keeps every existing `crate::denial_aggregator::DenialEvent` call site resolving without further edits. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move the lexical path-normalization helper from openshell-policy to openshell-core::paths so it can be reached from crates that sit below openshell-policy in the dependency graph. openshell-policy keeps its existing public API via a `pub use` re-export, so all current call sites (e.g. openshell-sandbox/src/policy.rs) continue to resolve unchanged. This is a prerequisite for lifting openshell-sandbox/src/policy.rs into openshell-core: that file's `From<ProtoFilesystemPolicy>` impl calls normalize_path, and lifting it as-is would cycle through openshell-policy. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move openshell-sandbox/src/policy.rs (SandboxPolicy, NetworkPolicy, ProxyPolicy, FilesystemPolicy, LandlockPolicy, ProcessPolicy, NetworkMode, LandlockCompatibility, plus their Proto* TryFrom/From impls) to openshell-core/src/policy.rs. Both prospective supervisor leaves (networking and process) dispatch on SandboxPolicy. Hosting it in openshell-core lets either leaf reach for it without depending on the other (or on the future orchestrator). The From<ProtoFilesystemPolicy> impl now calls the in-crate openshell_core::paths::normalize_path lifted in the previous commit, which is what made this move cycle-free. Update all crate::policy::* call sites in openshell-sandbox to openshell_core::policy::*. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

child_env (proxy_env_vars, tls_env_vars) is process-side only — consumed by process.rs and ssh.rs. With the orchestrator staying in openshell-sandbox (Shape A), openshell-sandbox depends on the new leaf crates, so process-only modules can land in openshell-supervisor-process directly. Add openshell-supervisor-process as a path dependency of openshell-sandbox. Update process.rs and ssh.rs to import from openshell_supervisor_process::child_env. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move the static skills installer (and its embedded resource directory) out of openshell-sandbox into openshell-supervisor-process. The module is process-side only — invoked once during sandbox start to drop agent skill files into the workspace — and has no cross-leaf consumers. Adds miette as a dependency and tempfile as a dev-dependency on openshell-supervisor-process. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…ll-sandbox Move the mechanistic mapper (HTTP method/path → operation classifier that derives policy proposals from connection summaries) out of openshell-sandbox into openshell-supervisor-networking. Single internal caller (run_policy_poll_loop in lib.rs) and only depends on openshell-core + tracing — no cross-leaf entanglement. First population of the openshell-supervisor-networking crate; adds openshell-core and tracing as dependencies. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move procfs (PID lookups, ancestor walking, /proc/net/tcp socket-owner resolution, file SHA256 hashing) from openshell-sandbox into openshell-core. The module is consumed cross-leaf — by bypass_monitor on the process side and by identity / proxy on the networking side — so it has to sit below both leaves. Adds tracing, sha2, and hex as dependencies on openshell-core. Updates the three call sites in openshell-sandbox to import from openshell_core::procfs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move BinaryIdentityCache (path → SHA256 cache used to identify the process behind an outbound connection) from openshell-sandbox into openshell-supervisor-networking. The cache is consumed only by the networking-side proxy and the orchestrator; with procfs already in openshell-core there are no remaining cross-leaf dependencies. Adds miette as a dependency and tempfile as a dev-dependency on openshell-supervisor-networking. Adds a Default impl for BinaryIdentityCache to satisfy clippy::new_without_default now that the type is publicly exposed across crates. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…l-sandbox Move AGENT_PROPOSALS_ENABLED, agent_proposals_enabled(), and the test-only ProposalsFlagGuard out of openshell-sandbox into openshell-supervisor-process::proposals. The flag is read only by the process-side policy_local route handler and the orchestrator; lifting it to openshell-core would have made core carry sandbox-owned runtime state without buying anything. The test-only ProposalsFlagGuard is still consumed from networking-side l7/rest tests today (until the wider Q2 OCSF/gRPC injection work lands). Expose it via a new optional `test-helpers` feature on openshell-supervisor-process so test crates opt in explicitly without pulling tokio sync primitives into production builds. openshell-sandbox keeps its existing crate-private path (`crate::AGENT_PROPOSALS_ENABLED`, `crate::test_helpers`) via re-exports so call sites and tests are unchanged. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move crates/openshell-sandbox/src/secrets.rs to crates/openshell-core/src/secrets.rs so both supervisor leaves can reach SecretResolver and the placeholder helpers without depending on openshell-sandbox. Add base64 to openshell-core deps (only stdlib + base64 are used). Promote previously pub(crate) constructors and methods on SecretResolver to pub since cross-crate callers (provider_credentials, proxy/L7 tests) now name them across the crate boundary. Update import paths in proxy.rs, l7/{rest,relay,websocket}.rs, and provider_credentials.rs from crate::secrets to openshell_core::secrets. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move crates/openshell-sandbox/src/provider_credentials.rs to crates/openshell-core/src/provider_credentials.rs. Both supervisor leaves now name ProviderCredentialState in their function signatures (run_networking takes &ProviderCredentialState, run_process takes ProviderCredentialState by value), and under Shape A leaves can't depend on openshell-sandbox, so the type must live in openshell-core. The orchestrator (run_sandbox in openshell-sandbox) remains the only writer: it constructs ProviderCredentialState::from_environment and the policy poll loop calls install_environment on credential rotation. Both leaves stay pure readers via snapshot()/resolver(). Update import paths in proxy.rs, ssh.rs, and lib.rs from crate::provider_credentials to openshell_core::provider_credentials. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move the process-wide OCSF SandboxContext OnceLock + LazyLock fallback + getter from openshell-sandbox/src/lib.rs into a new openshell-ocsf::ctx module. The type already lives in openshell-ocsf, so its singleton lives next to it. Add openshell_ocsf::ctx::set_ctx() and openshell_ocsf::ctx::ctx(). The orchestrator (run_sandbox) now calls set_ctx during startup. Sandbox keeps a pub(crate) use openshell_ocsf::ctx::ctx as ocsf_ctx; re-export so the 138 existing crate::ocsf_ctx() call sites resolve unchanged. When the sandbox modules themselves migrate into the leaf crates, they'll import openshell_ocsf::ctx directly and the re-export goes away. Under Shape A neither leaf can depend on openshell-sandbox; both already depend on openshell-ocsf to construct events, so this adds no new dep edge. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Both prospective leaves (supervisor-networking and supervisor-process) need CachedOpenShellClient, AuthedChannel, and the connect/fetch helpers. Under Shape A the leaves cannot depend on openshell-sandbox, so the type has to live below them. openshell-core already pulls in tonic and miette; this enables tonic's channel/tls features and adds tokio as a direct dep. Updates all crate::grpc_client::* call sites in openshell-sandbox to openshell_core::grpc_client::*. No re-export shim — the call-site count was small enough to update directly. See architecture/plans/sandbox-split-design-choices.md for the full rationale and trade-offs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…l-sandbox DenialAggregator and FlushableDenialSummary belong with the proxy and L7 layer that emit denials. Moves the file into openshell-supervisor-networking; adds tokio as a regular dep there since DenialAggregator uses tokio::sync::mpsc. Drops the pub use openshell_core::DenialEvent re-export inside the moved file (no longer needed cross-crate). Updates bypass_monitor.rs, proxy.rs, and lib.rs to import openshell_core::DenialEvent directly. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

LogPushLayer is a process-side tracing layer that streams sandbox logs to the gateway via gRPC. Moves into openshell-supervisor-process; adds openshell-core, openshell-ocsf, tokio-stream, tracing, and tracing-subscriber as direct deps there. Updates the only external call site (openshell-sandbox/src/main.rs) to import from openshell_supervisor_process::log_push. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

bypass_monitor reads /dev/kmsg for nftables drop log lines and emits denial events. Pure process-side concern, called only from run_networking which spawns it on the netns. Moves into openshell-supervisor-process; all deps (openshell-core, openshell-ocsf, tokio, tracing) were already declared there. Replaces crate::ocsf_ctx() shim calls inside the moved file with openshell_ocsf::ctx::ctx() — first leaf-side caller to import the OCSF context singleton directly instead of going through openshell-sandbox's re-export. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

debug_rpc is the CLI subcommand handler that exercises authenticated gRPC calls (issue-token, refresh-token, get-config, etc.). Pure process-side concern, called only from openshell-sandbox/main.rs. Adds base64, hex, serde_json, sha2, and tonic (with channel/tls features) as direct deps on openshell-supervisor-process. Updates the single call site in main.rs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…sandbox supervisor_session opens a bidirectional gRPC stream that lets the gateway initiate shells inside the sandbox. Pure process-side concern, called only from run_process. Adds uuid as a direct dep on openshell-supervisor-process. Replaces crate::ocsf_ctx() shim calls inside the moved file with openshell_ocsf::ctx::ctx() — same pattern as bypass_monitor. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…shell-sandbox The MANAGED_CHILDREN set tracks PIDs of supervisor-spawned children (entrypoint + SSH sessions) so the orchestrator's SIGCHLD reaper can distinguish them from incidental zombies. Pure process-side concern, moves to openshell_supervisor_process::managed_children with three public fns: register, unregister, is_managed. Updates lib.rs reaper, process.rs, and ssh.rs to call through the new module path. Drops the now-unused HashSet import from lib.rs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…andbox Lift the process-only hardening pieces (landlock, seccomp, PreparedSandbox, prepare/enforce, log_sandbox_readiness, top-level apply, and apply_supervisor_startup_hardening) from crates/openshell-sandbox/src/sandbox/ to crates/openshell-supervisor-process/src/sandbox/. Leave netns.rs and nft_ruleset.rs in openshell-sandbox for now, since both eventual leaf crates (supervisor-networking and supervisor-process) read from NetworkNamespace and its final home is decided when run_networking and run_process are extracted. Replace crate::ocsf_ctx() shims in landlock.rs and the new linux/mod.rs with direct openshell_ocsf::ctx::ctx() calls. Update call sites in lib.rs, process.rs, and ssh.rs to import sandbox from openshell_supervisor_process while keeping the netns import unchanged. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move proposals.rs (AGENT_PROPOSALS_ENABLED OnceLock + agent_proposals_enabled reader + test_helpers::ProposalsFlagGuard) from openshell-supervisor-process to openshell-core so both eventual leaf crates can read it without depending on each other. The flag is process-wide singleton state initialised once during sandbox startup and read by both the policy.local route (networking-side) and the skills installer (process-side) — same shape as openshell_ocsf::ctx. Move the test-helpers Cargo feature alongside it: openshell-core gains the feature, openshell-supervisor-process loses it, and openshell-sandbox's dev-dependency now enables openshell-core/test-helpers. Update the sandbox re-export shim to point at openshell_core::proposals. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move NetworkNamespace and the nft_ruleset bypass-rule generator from crates/openshell-sandbox/src/sandbox/linux/ to crates/openshell-core/src/netns/. Both eventual leaf crates (supervisor-networking and supervisor-process) read from NetworkNamespace, so it must live somewhere both can depend on without violating the Shape A no-leaf-to-leaf rule. Replace crate::ocsf_ctx() shims in netns with direct openshell_ocsf::ctx::ctx() calls, matching the pattern used in already-migrated process modules. Update super::nft_ruleset references inside netns to nft_ruleset since the module is now a sibling sub-module of netns/mod.rs. Add openshell-ocsf and uuid as linux-only dependencies of openshell-core, and gate pub mod netns on target_os = "linux" since the implementation uses netlink, ip(8), and namespace fds. Delete the now-empty sandbox/{mod.rs, linux/mod.rs} stubs and update NetworkNamespace import paths in lib.rs and process.rs to point at openshell_core::netns. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…ll-sandbox Lift the entrypoint process spawn module and the embedded SSH server module into openshell-supervisor-process. openshell-sandbox now re-exports ProcessHandle/ProcessStatus and calls openshell_supervisor_process::ssh::run_ssh_server directly. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…om openshell-sandbox Lift the egress proxy, L7 enforcement modules, OPA engine, and policy.local advisor API into openshell-supervisor-networking. Move accompanying data files (sandbox-policy.rego), test fixtures (testdata/), and integration tests (system_inference, websocket_upgrade). Sandbox lib.rs now references these via openshell_supervisor_networking::* and ProxyHandle::start_with_bind_addr is exposed as pub for the orchestrator call site. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…orchestrator Move the symlink-resolver, policy poll loop, and denial-aggregator flush spawns out of run_process and into run_sandbox so run_process no longer needs OpaEngine, retained_proto, the local policy context, the sandbox name, the gateway endpoint for telemetry, the OCSF flag, or the denial receiver. These long-running orchestrator-owned tasks now live alongside the other sandbox-startup wiring, matching the design log decision in architecture/plans/sandbox-split-design-choices.md (Q5). Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Lift the workload supervision entry point (zombie reaper, SSH server spawn, supervisor session, entrypoint child spawn, exit-with-timeout) into its own module in openshell-supervisor-process. The orchestrator in openshell-sandbox now calls openshell_supervisor_process::run::run_process directly. With this move run_process names only types from openshell-core, openshell-ocsf, openshell-supervisor-process itself, std, and tokio — no openshell-supervisor-networking dependency. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

The supervisor seccomp prelude is part of "set up the workload-side process tree", not part of orchestration. Move the call site from run_sandbox into the top of run_process and drop the now-unused re-export from openshell-sandbox::lib. Timing is preserved: by the time the orchestrator calls run_process, run_networking has already returned, so netns + nftables setup is complete. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…rocess The PID-limit precondition is process-side: it gates whether the workload child can be spawned at all. Move the call from run_sandbox into the top of run_process, alongside the seccomp prelude. Same shape as the seccomp move — function already lives process-side, only the call site relocates. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…rate The sandbox-user check is a precondition for privilege-dropping the workload child; it has no relevance to networking. Move the function next to drop_privileges in openshell-supervisor-process::process and call it from the top of run_process. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Creating and chowning read_write directories is workload-side preparation, not orchestration. Move prepare_filesystem and its prepare_read_write_path helper (plus tests) into openshell-supervisor-process::process and call from run_process, alongside validate_sandbox_user. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…cess The eager initial-settings fetch + agent skill install is process-side: the install materializes files the workload's filesystem sees. The orchestrator still owns the AGENT_PROPOSALS_ENABLED OnceLock init because the policy poll loop also reads it; only the early fetch and install hop into run_process. Behavior unchanged. Best-effort: any RPC or install failure is logged but does not fail startup. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move the PolicyLocalContext construction from run_sandbox into run_networking. The orchestrator was building it solely to thread it into the networking leaf and to share it with the policy poll loop; now run_networking builds it from inputs it already takes (retained_proto, openshell_endpoint, sandbox_name|sandbox_id) and exposes it on the returned Networking struct. The orchestrator's poll loop now grabs the Arc clone from networking.policy_local_ctx, so the orchestrator no longer imports openshell_supervisor_network::policy_local. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Add a --mode flag (default "network,process") that selects which supervisor leaves run in the current process. Two new shapes are unlocked without splitting the binary: --mode=network # network-only sidecar --mode=process # process-only supervisor --mode=network,process # combined (default; current behavior) In network-only mode the orchestrator skips run_process and waits on SIGINT/SIGTERM before tearing down the proxy. The entrypoint PID stays at 0 for the lifetime of the process, which silently degrades the proxy's binary-identity TOFU and the bypass monitor's PID enrichment; this is correct in a split-pod topology where the workload's /proc lives in another pod. In process-only mode run_networking is skipped entirely. SSH sessions get no proxy URL, no netns FD, and no CA paths, matching what a split-pod consumer would expect when network enforcement is delegated to a sidecar. The policy poll loop continues to run unconditionally; its OPA-reload and policy.local hooks already gate on the resources only present when network is enabled, and the env-refresh / proposals-toggle hooks remain active in process mode. Closes a step toward the RFC-0001 supervisor topology proposed in issue NVIDIA#1305 by drew. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

copy-pr-bot · 2026-06-01T13:03:40Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-06-01T13:03:50Z

All contributors have signed the DCO ✍️ ✅
_{Posted by the DCO Assistant Lite bot.}

rrhubenov · 2026-06-01T13:04:57Z

I have read the DCO document and I hereby sign the DCO.

rrhubenov · 2026-06-01T13:05:15Z

recheck

DenialEvent is only emitted and consumed inside openshell-supervisor-network (proxy, bypass monitor, denial aggregator). It never crossed the leaf boundary, so the earlier lift to openshell-core was speculative. Move it back into the network crate where its only callers live. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

procfs was lifted to openshell-core under the assumption it would be shared cross-leaf, but on the current branch all three callers (bypass_monitor, identity, proxy) live in openshell-supervisor-network. No file in openshell-supervisor-process imports it. Move the module to the network crate and drop sha2/hex from openshell-core, which were pulled in only for procfs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

derekwaynecarr · 2026-06-01T14:35:13Z

/ok to test 1e22061

copy-pr-bot · 2026-06-01T14:35:18Z

/ok to test 1e22061

@derekwaynecarr, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

derekwaynecarr · 2026-06-01T14:37:12Z

/ok to test 1e22061

copy-pr-bot · 2026-06-01T14:37:17Z

/ok to test 1e22061

@derekwaynecarr, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

TaylorMutch · 2026-06-01T15:46:15Z

/ok to test 33e00fd

github-actions · 2026-06-01T18:09:06Z

Label test:e2e applied for 33e00fd. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

drew

This is looking good, thanks for taking this on.

I haven't renamed the openshell-sandbox crate to openshell-supervisor since if done fully (e.g. renaming the binary as well), would result in a breaking change which I see as out of scope for this PR.

Good call, we can figure this out separately.

One area I focused the review on were packages moved to openshell-core. Generally these looked good. Moving grpc client helpers, policy infra, provider components to core all make sense. Left a comment around the netns/nft_ruleset moves.

drew · 2026-06-02T06:11:59Z

+        let resolve_proto = proto.clone();
+        let resolve_pid = entrypoint_pid.clone();
+        tokio::spawn(async move {
+            let pid = resolve_pid.load(Ordering::Acquire);


(from codex review)

run_networking starts before run_process stores the child PID, so loading entrypoint_pid once here usually captures 0. The retry loop then keeps probing /proc/0/root and never reloads OPA with the real workload root, which regresses symlinked binary policies such as /usr/bin/python3.

Suggested fix: wait for a non-zero PID separately, then start the existing /proc//root probe loop. That keeps the old 5s retry window focused on “is proc root ready?” instead of spending it before the process has even spawned.

let mut pid = 0; for attempt in 1..=20 { pid = resolve_pid.load(Ordering::Acquire); if pid != 0 { break; } debug!(attempt, "Entrypoint PID not available yet, retrying symlink resolution"); tokio::time::sleep(Duration::from_millis(250)).await; } if pid == 0 { warn!("Entrypoint PID not available; binary symlink resolution skipped"); return; } let probe_path = format!("/proc/{pid}/root/"); for attempt in 1..=10 { tokio::time::sleep(Duration::from_millis(500)).await; if std::fs::metadata(&probe_path).is_ok() { // existing reload_from_proto_with_pid(&resolve_proto, pid) block return; } }

drew · 2026-06-02T06:14:02Z

 name = "openshell-sandbox"
 path = "src/main.rs"

 [dependencies]


Now that openshell-sandbox is acting as the orchestrator over the process/network leaf crates, this manifest still appears to carry a lot of leaf-only dependencies from the old monolith (regorus, russh, rcgen, tokio-rustls, ipnet, apollo-parser, openshell-router, etc.).

Can we prune the direct deps here to only what the orchestrator actually uses?

drew · 2026-06-02T06:14:49Z

drew · 2026-06-02T06:21:06Z

 pub mod metadata;
 pub mod net;
+#[cfg(target_os = "linux")]
+pub mod netns;


netns/nft_ruleset probably don't belong in openshell-core. These modules are privileged Linux supervisor runtime implementation: they create namespaces, manage veths, invoke ip/nsenter/nft, install bypass rules, and emit sandbox OCSF events. That makes core own sandbox enforcement machinery, not just shared types/config.

Can we keep this out of core? The process leaf appears to only need the namespace fd for setns, so one option is for the network/orchestrator side to own the NetworkNamespace RAII handle and pass an Option<i32> fd into run_process. If more sharing is needed, a small supervisor-runtime/netns crate would preserve the no process <-> network dependency rule without expanding openshell-core into privileged runtime code.

drew · 2026-06-02T06:27:58Z

/ok to test 8526c54

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

The procfs/bypass_monitor/proxy test modules use libc::{fork, exec, fcntl, kill, waitpid} but the dep wasn't declared in this crate's Cargo.toml. It was previously satisfied transitively when these modules lived in openshell-core; the move left the test target unable to resolve libc. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

rrhubenov added 30 commits June 1, 2026 11:30

style: rustfmt import ordering

b9dc830

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

rrhubenov added 8 commits June 1, 2026 14:40

style(supervisor-process): rustfmt long debug! line

76a89cf

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

rrhubenov requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners June 1, 2026 13:03

rrhubenov added 2 commits June 1, 2026 16:42

drew added the test:e2e Requires end-to-end coverage label Jun 1, 2026

drew reviewed Jun 2, 2026

View reviewed changes

rrhubenov added 2 commits June 2, 2026 10:14

style(supervisor-network): run cargo fmt

bbefa45

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

rrhubenov force-pushed the refactor/split-sandbox branch from 8526c54 to 7a3fdd7 Compare June 2, 2026 13:00

Conversation

rrhubenov commented Jun 1, 2026

Summary

Review suggestions:

Related Issue

Changes

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rrhubenov commented Jun 1, 2026

Uh oh!

rrhubenov commented Jun 1, 2026

Uh oh!

derekwaynecarr commented Jun 1, 2026

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

derekwaynecarr commented Jun 1, 2026

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

TaylorMutch commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

drew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drew Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

drew Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

drew Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

drew Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

drew commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented Jun 1, 2026 •

edited

Loading

drew left a comment •

edited

Loading