Skip to content

refactor(openshell-sandbox): Split sandbox into process and network subcrates.#1650

Open
rrhubenov wants to merge 49 commits into
NVIDIA:mainfrom
rrhubenov:refactor/split-sandbox
Open

refactor(openshell-sandbox): Split sandbox into process and network subcrates.#1650
rrhubenov wants to merge 49 commits into
NVIDIA:mainfrom
rrhubenov:refactor/split-sandbox

Conversation

@rrhubenov
Copy link
Copy Markdown

Summary

This PR is an implementation of #1305.
For full context on why this is required, please read #981 and #1305 in that order.

In summary, this split has been proposed for 2 main reasons:

  • As a precursor to a possible split-pod topology of the supervisor (previously sandbox)
  • In general a better decomposition of the 2 main "categories" of functionality ("process" and "network")

Although this change has been mostly evaluated from the perspective of the "split pod" topology, this splitting allows for a more agile topology regardless of environment. If for any reason the "process" and "network" functionality need to work as standalone processes, this split is required.

Review suggestions:

This refactor is hard to review due to a big collection of changes that are mostly movements of files and/or code blocks. For this reason, I suggest that reviewers take the approach of reviewing the changes commit by commit. I have taken care that each commit is as small as possible, moving only one functionality (module, function, etc.) in one commit.

Bear in mind that during the implementation, one very important architectural decision is that the 2 new crates (supervisor-network and supervisor-process) do not have a dependency on one another.
Only the "orcherstrator" crate, that being the openshell-supervisor should have a dependency on both of them.
One other important point was that no behavioural changes should occur as a side-effect from this refactoring.

NOTE:
I haven't renamed the openshell-sandbox crate to openshell-supervisor since if done fully (e.g. renaming the binary as well), would result in a breaking change which I see as out of scope for this PR.

I have also tried to minimise the amount of architectural design decisions in this PR, since it is already big enough.

Related Issue

Closes #1305

Changes

Testing

  • mise run pre-commit passes
  • Unit tests added/updated (no update was required due to this change being only a refactor) N/A
  • E2E tests added/updated (if applicable) N/A

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable) N/A - review determined that docs do not go into implementation details

rrhubenov added 30 commits June 1, 2026 11:30
Lifts TLS state generation, network namespace setup, proxy startup,
bypass monitor spawn, and SSH-side proxy URL / netns FD computation
out of run_sandbox into a sibling async fn `run_networking` that
returns a Networking struct. The identity cache moves with it (only
consumed by the proxy). Entrypoint PID allocation moves just above
the call site so it can be passed in.

No behavior changes — same OCSF emits, same async order, same RAII
lifetimes for the proxy and bypass-monitor handles, now held by the
returned Networking value in run_sandbox's frame.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Lifts the post-networking tail of `run_sandbox` (zombie reaper, SSH
server, supervisor session, process spawn, OPA probe, policy poll loop,
denial aggregator, wait/exit) into a sibling async fn `run_process`.

Also moves network namespace creation out of `run_networking` into a new
`create_netns_for_proxy` helper invoked from `run_sandbox`, so
`run_networking` is purely the proxy component (OPA evaluation, TLS
interception, credential injection, inference routing, gRPC control
API). The netns is then borrowed into both `run_networking` and
`run_process`.

No behavior change.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…ell-supervisor-process crates

Add empty placeholder crates that subsequent commits will populate as the
sandbox decomposition proceeds. Both crates compile clean as part of the
workspace and are picked up automatically by the existing
`members = ["crates/*"]` glob.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
The DenialEvent struct is emitted by both the proxy/L7 layer (networking-side)
and the bypass monitor (process-side), and crosses the run_networking ->
run_process API boundary. Move it to openshell-core so the eventual
supervisor-networking and supervisor-process crates can both reference it
without depending on each other. DenialAggregator and the channel/flush
helpers stay in openshell-sandbox for now.

A thin `pub use openshell_core::DenialEvent;` re-export from
denial_aggregator.rs keeps every existing `crate::denial_aggregator::DenialEvent`
call site resolving without further edits.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move the lexical path-normalization helper from openshell-policy to
openshell-core::paths so it can be reached from crates that sit below
openshell-policy in the dependency graph. openshell-policy keeps its
existing public API via a `pub use` re-export, so all current call sites
(e.g. openshell-sandbox/src/policy.rs) continue to resolve unchanged.

This is a prerequisite for lifting openshell-sandbox/src/policy.rs into
openshell-core: that file's `From<ProtoFilesystemPolicy>` impl calls
normalize_path, and lifting it as-is would cycle through openshell-policy.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move openshell-sandbox/src/policy.rs (SandboxPolicy, NetworkPolicy,
ProxyPolicy, FilesystemPolicy, LandlockPolicy, ProcessPolicy, NetworkMode,
LandlockCompatibility, plus their Proto* TryFrom/From impls) to
openshell-core/src/policy.rs.

Both prospective supervisor leaves (networking and process) dispatch on
SandboxPolicy. Hosting it in openshell-core lets either leaf reach for it
without depending on the other (or on the future orchestrator).

The From<ProtoFilesystemPolicy> impl now calls the in-crate
openshell_core::paths::normalize_path lifted in the previous commit, which
is what made this move cycle-free.

Update all crate::policy::* call sites in openshell-sandbox to
openshell_core::policy::*.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
child_env (proxy_env_vars, tls_env_vars) is process-side only — consumed
by process.rs and ssh.rs. With the orchestrator staying in
openshell-sandbox (Shape A), openshell-sandbox depends on the new leaf
crates, so process-only modules can land in
openshell-supervisor-process directly.

Add openshell-supervisor-process as a path dependency of
openshell-sandbox. Update process.rs and ssh.rs to import from
openshell_supervisor_process::child_env.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move the static skills installer (and its embedded resource directory)
out of openshell-sandbox into openshell-supervisor-process. The module
is process-side only — invoked once during sandbox start to drop
agent skill files into the workspace — and has no cross-leaf consumers.

Adds miette as a dependency and tempfile as a dev-dependency on
openshell-supervisor-process.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…ll-sandbox

Move the mechanistic mapper (HTTP method/path → operation classifier
that derives policy proposals from connection summaries) out of
openshell-sandbox into openshell-supervisor-networking. Single internal
caller (run_policy_poll_loop in lib.rs) and only depends on
openshell-core + tracing — no cross-leaf entanglement.

First population of the openshell-supervisor-networking crate; adds
openshell-core and tracing as dependencies.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move procfs (PID lookups, ancestor walking, /proc/net/tcp socket-owner
resolution, file SHA256 hashing) from openshell-sandbox into
openshell-core. The module is consumed cross-leaf — by bypass_monitor
on the process side and by identity / proxy on the networking side —
so it has to sit below both leaves.

Adds tracing, sha2, and hex as dependencies on openshell-core.
Updates the three call sites in openshell-sandbox to import from
openshell_core::procfs.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move BinaryIdentityCache (path → SHA256 cache used to identify the
process behind an outbound connection) from openshell-sandbox into
openshell-supervisor-networking. The cache is consumed only by the
networking-side proxy and the orchestrator; with procfs already in
openshell-core there are no remaining cross-leaf dependencies.

Adds miette as a dependency and tempfile as a dev-dependency on
openshell-supervisor-networking. Adds a Default impl for
BinaryIdentityCache to satisfy clippy::new_without_default now that
the type is publicly exposed across crates.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…l-sandbox

Move AGENT_PROPOSALS_ENABLED, agent_proposals_enabled(), and the
test-only ProposalsFlagGuard out of openshell-sandbox into
openshell-supervisor-process::proposals. The flag is read only by the
process-side policy_local route handler and the orchestrator; lifting
it to openshell-core would have made core carry sandbox-owned runtime
state without buying anything.

The test-only ProposalsFlagGuard is still consumed from networking-side
l7/rest tests today (until the wider Q2 OCSF/gRPC injection work lands).
Expose it via a new optional `test-helpers` feature on
openshell-supervisor-process so test crates opt in explicitly without
pulling tokio sync primitives into production builds.

openshell-sandbox keeps its existing crate-private path
(`crate::AGENT_PROPOSALS_ENABLED`, `crate::test_helpers`) via re-exports
so call sites and tests are unchanged.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move crates/openshell-sandbox/src/secrets.rs to crates/openshell-core/src/secrets.rs so both supervisor leaves can reach SecretResolver and the placeholder helpers without depending on openshell-sandbox.

Add base64 to openshell-core deps (only stdlib + base64 are used). Promote previously pub(crate) constructors and methods on SecretResolver to pub since cross-crate callers (provider_credentials, proxy/L7 tests) now name them across the crate boundary. Update import paths in proxy.rs, l7/{rest,relay,websocket}.rs, and provider_credentials.rs from crate::secrets to openshell_core::secrets.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move crates/openshell-sandbox/src/provider_credentials.rs to crates/openshell-core/src/provider_credentials.rs. Both supervisor leaves now name ProviderCredentialState in their function signatures (run_networking takes &ProviderCredentialState, run_process takes ProviderCredentialState by value), and under Shape A leaves can't depend on openshell-sandbox, so the type must live in openshell-core.

The orchestrator (run_sandbox in openshell-sandbox) remains the only writer: it constructs ProviderCredentialState::from_environment and the policy poll loop calls install_environment on credential rotation. Both leaves stay pure readers via snapshot()/resolver().

Update import paths in proxy.rs, ssh.rs, and lib.rs from crate::provider_credentials to openshell_core::provider_credentials.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move the process-wide OCSF SandboxContext OnceLock + LazyLock fallback + getter from openshell-sandbox/src/lib.rs into a new openshell-ocsf::ctx module. The type already lives in openshell-ocsf, so its singleton lives next to it.

Add openshell_ocsf::ctx::set_ctx() and openshell_ocsf::ctx::ctx(). The orchestrator (run_sandbox) now calls set_ctx during startup. Sandbox keeps a pub(crate) use openshell_ocsf::ctx::ctx as ocsf_ctx; re-export so the 138 existing crate::ocsf_ctx() call sites resolve unchanged.

When the sandbox modules themselves migrate into the leaf crates, they'll import openshell_ocsf::ctx directly and the re-export goes away.

Under Shape A neither leaf can depend on openshell-sandbox; both already depend on openshell-ocsf to construct events, so this adds no new dep edge.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Both prospective leaves (supervisor-networking and supervisor-process)
need CachedOpenShellClient, AuthedChannel, and the connect/fetch
helpers. Under Shape A the leaves cannot depend on openshell-sandbox,
so the type has to live below them. openshell-core already pulls in
tonic and miette; this enables tonic's channel/tls features and adds
tokio as a direct dep.

Updates all crate::grpc_client::* call sites in openshell-sandbox to
openshell_core::grpc_client::*. No re-export shim — the call-site
count was small enough to update directly.

See architecture/plans/sandbox-split-design-choices.md for the full
rationale and trade-offs.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…l-sandbox

DenialAggregator and FlushableDenialSummary belong with the proxy and L7
layer that emit denials. Moves the file into openshell-supervisor-networking;
adds tokio as a regular dep there since DenialAggregator uses
tokio::sync::mpsc.

Drops the pub use openshell_core::DenialEvent re-export inside the moved
file (no longer needed cross-crate). Updates bypass_monitor.rs, proxy.rs,
and lib.rs to import openshell_core::DenialEvent directly.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
LogPushLayer is a process-side tracing layer that streams sandbox logs
to the gateway via gRPC. Moves into openshell-supervisor-process; adds
openshell-core, openshell-ocsf, tokio-stream, tracing, and
tracing-subscriber as direct deps there.

Updates the only external call site (openshell-sandbox/src/main.rs) to
import from openshell_supervisor_process::log_push.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
bypass_monitor reads /dev/kmsg for nftables drop log lines and emits
denial events. Pure process-side concern, called only from
run_networking which spawns it on the netns. Moves into
openshell-supervisor-process; all deps (openshell-core, openshell-ocsf,
tokio, tracing) were already declared there.

Replaces crate::ocsf_ctx() shim calls inside the moved file with
openshell_ocsf::ctx::ctx() — first leaf-side caller to import the OCSF
context singleton directly instead of going through openshell-sandbox's
re-export.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
debug_rpc is the CLI subcommand handler that exercises authenticated
gRPC calls (issue-token, refresh-token, get-config, etc.). Pure
process-side concern, called only from openshell-sandbox/main.rs.

Adds base64, hex, serde_json, sha2, and tonic (with channel/tls
features) as direct deps on openshell-supervisor-process. Updates the
single call site in main.rs.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…sandbox

supervisor_session opens a bidirectional gRPC stream that lets the
gateway initiate shells inside the sandbox. Pure process-side concern,
called only from run_process. Adds uuid as a direct dep on
openshell-supervisor-process.

Replaces crate::ocsf_ctx() shim calls inside the moved file with
openshell_ocsf::ctx::ctx() — same pattern as bypass_monitor.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…shell-sandbox

The MANAGED_CHILDREN set tracks PIDs of supervisor-spawned children
(entrypoint + SSH sessions) so the orchestrator's SIGCHLD reaper can
distinguish them from incidental zombies. Pure process-side concern,
moves to openshell_supervisor_process::managed_children with three
public fns: register, unregister, is_managed.

Updates lib.rs reaper, process.rs, and ssh.rs to call through the new
module path. Drops the now-unused HashSet import from lib.rs.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…andbox

Lift the process-only hardening pieces (landlock, seccomp, PreparedSandbox,
prepare/enforce, log_sandbox_readiness, top-level apply, and
apply_supervisor_startup_hardening) from crates/openshell-sandbox/src/sandbox/
to crates/openshell-supervisor-process/src/sandbox/.

Leave netns.rs and nft_ruleset.rs in openshell-sandbox for now, since both
eventual leaf crates (supervisor-networking and supervisor-process) read from
NetworkNamespace and its final home is decided when run_networking and
run_process are extracted.

Replace crate::ocsf_ctx() shims in landlock.rs and the new linux/mod.rs with
direct openshell_ocsf::ctx::ctx() calls. Update call sites in lib.rs,
process.rs, and ssh.rs to import sandbox from openshell_supervisor_process
while keeping the netns import unchanged.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move proposals.rs (AGENT_PROPOSALS_ENABLED OnceLock + agent_proposals_enabled
reader + test_helpers::ProposalsFlagGuard) from openshell-supervisor-process
to openshell-core so both eventual leaf crates can read it without depending
on each other.

The flag is process-wide singleton state initialised once during sandbox
startup and read by both the policy.local route (networking-side) and the
skills installer (process-side) — same shape as openshell_ocsf::ctx.

Move the test-helpers Cargo feature alongside it: openshell-core gains the
feature, openshell-supervisor-process loses it, and openshell-sandbox's
dev-dependency now enables openshell-core/test-helpers. Update the sandbox
re-export shim to point at openshell_core::proposals.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move NetworkNamespace and the nft_ruleset bypass-rule generator from
crates/openshell-sandbox/src/sandbox/linux/ to crates/openshell-core/src/netns/.
Both eventual leaf crates (supervisor-networking and supervisor-process) read
from NetworkNamespace, so it must live somewhere both can depend on without
violating the Shape A no-leaf-to-leaf rule.

Replace crate::ocsf_ctx() shims in netns with direct openshell_ocsf::ctx::ctx()
calls, matching the pattern used in already-migrated process modules. Update
super::nft_ruleset references inside netns to nft_ruleset since the module
is now a sibling sub-module of netns/mod.rs.

Add openshell-ocsf and uuid as linux-only dependencies of openshell-core, and
gate pub mod netns on target_os = "linux" since the implementation uses
netlink, ip(8), and namespace fds. Delete the now-empty sandbox/{mod.rs,
linux/mod.rs} stubs and update NetworkNamespace import paths in lib.rs and
process.rs to point at openshell_core::netns.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…ll-sandbox

Lift the entrypoint process spawn module and the embedded SSH server
module into openshell-supervisor-process. openshell-sandbox now
re-exports ProcessHandle/ProcessStatus and calls
openshell_supervisor_process::ssh::run_ssh_server directly.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…om openshell-sandbox

Lift the egress proxy, L7 enforcement modules, OPA engine, and policy.local
advisor API into openshell-supervisor-networking. Move accompanying data
files (sandbox-policy.rego), test fixtures (testdata/), and integration
tests (system_inference, websocket_upgrade). Sandbox lib.rs now references
these via openshell_supervisor_networking::* and ProxyHandle::start_with_bind_addr
is exposed as pub for the orchestrator call site.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…orchestrator

Move the symlink-resolver, policy poll loop, and denial-aggregator flush
spawns out of run_process and into run_sandbox so run_process no longer
needs OpaEngine, retained_proto, the local policy context, the sandbox
name, the gateway endpoint for telemetry, the OCSF flag, or the denial
receiver. These long-running orchestrator-owned tasks now live alongside
the other sandbox-startup wiring, matching the design log decision in
architecture/plans/sandbox-split-design-choices.md (Q5).

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Lift the workload supervision entry point (zombie reaper, SSH server
spawn, supervisor session, entrypoint child spawn, exit-with-timeout)
into its own module in openshell-supervisor-process. The orchestrator
in openshell-sandbox now calls openshell_supervisor_process::run::run_process
directly. With this move run_process names only types from openshell-core,
openshell-ocsf, openshell-supervisor-process itself, std, and tokio —
no openshell-supervisor-networking dependency.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
rrhubenov added 8 commits June 1, 2026 14:40
The supervisor seccomp prelude is part of "set up the workload-side
process tree", not part of orchestration. Move the call site from
run_sandbox into the top of run_process and drop the now-unused
re-export from openshell-sandbox::lib.

Timing is preserved: by the time the orchestrator calls run_process,
run_networking has already returned, so netns + nftables setup is
complete.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…rocess

The PID-limit precondition is process-side: it gates whether the workload
child can be spawned at all. Move the call from run_sandbox into the top
of run_process, alongside the seccomp prelude. Same shape as the seccomp
move — function already lives process-side, only the call site relocates.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…rate

The sandbox-user check is a precondition for privilege-dropping the
workload child; it has no relevance to networking. Move the function
next to drop_privileges in openshell-supervisor-process::process and
call it from the top of run_process.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Creating and chowning read_write directories is workload-side
preparation, not orchestration. Move prepare_filesystem and its
prepare_read_write_path helper (plus tests) into
openshell-supervisor-process::process and call from run_process,
alongside validate_sandbox_user.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
…cess

The eager initial-settings fetch + agent skill install is process-side:
the install materializes files the workload's filesystem sees. The
orchestrator still owns the AGENT_PROPOSALS_ENABLED OnceLock init
because the policy poll loop also reads it; only the early fetch and
install hop into run_process.

Behavior unchanged. Best-effort: any RPC or install failure is logged
but does not fail startup.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Move the PolicyLocalContext construction from run_sandbox into
run_networking. The orchestrator was building it solely to thread it into
the networking leaf and to share it with the policy poll loop; now
run_networking builds it from inputs it already takes (retained_proto,
openshell_endpoint, sandbox_name|sandbox_id) and exposes it on the
returned Networking struct.

The orchestrator's poll loop now grabs the Arc clone from
networking.policy_local_ctx, so the orchestrator no longer imports
openshell_supervisor_network::policy_local.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Add a --mode flag (default "network,process") that selects which
supervisor leaves run in the current process. Two new shapes are
unlocked without splitting the binary:

  --mode=network             # network-only sidecar
  --mode=process             # process-only supervisor
  --mode=network,process     # combined (default; current behavior)

In network-only mode the orchestrator skips run_process and waits on
SIGINT/SIGTERM before tearing down the proxy. The entrypoint PID stays
at 0 for the lifetime of the process, which silently degrades the
proxy's binary-identity TOFU and the bypass monitor's PID enrichment;
this is correct in a split-pod topology where the workload's /proc
lives in another pod.

In process-only mode run_networking is skipped entirely. SSH sessions
get no proxy URL, no netns FD, and no CA paths, matching what a
split-pod consumer would expect when network enforcement is delegated
to a sidecar.

The policy poll loop continues to run unconditionally; its OPA-reload
and policy.local hooks already gate on the resources only present when
network is enabled, and the env-refresh / proposals-toggle hooks
remain active in process mode.

Closes a step toward the RFC-0001 supervisor topology proposed in
issue NVIDIA#1305 by drew.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@rrhubenov
Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@rrhubenov
Copy link
Copy Markdown
Author

recheck

rrhubenov added 2 commits June 1, 2026 16:42
DenialEvent is only emitted and consumed inside openshell-supervisor-network
(proxy, bypass monitor, denial aggregator). It never crossed the leaf
boundary, so the earlier lift to openshell-core was speculative. Move it
back into the network crate where its only callers live.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
procfs was lifted to openshell-core under the assumption it would be
shared cross-leaf, but on the current branch all three callers
(bypass_monitor, identity, proxy) live in openshell-supervisor-network.
No file in openshell-supervisor-process imports it. Move the module to
the network crate and drop sha2/hex from openshell-core, which were
pulled in only for procfs.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
@derekwaynecarr
Copy link
Copy Markdown
Collaborator

/ok to test 1e22061

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 1, 2026

/ok to test 1e22061

@derekwaynecarr, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@derekwaynecarr
Copy link
Copy Markdown
Collaborator

/ok to test 1e22061

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 1, 2026

/ok to test 1e22061

@derekwaynecarr, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@TaylorMutch
Copy link
Copy Markdown
Collaborator

/ok to test 33e00fd

@drew drew added the test:e2e Requires end-to-end coverage label Jun 1, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Label test:e2e applied for 33e00fd. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

Copy link
Copy Markdown
Collaborator

@drew drew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, thanks for taking this on.

I haven't renamed the openshell-sandbox crate to openshell-supervisor since if done fully (e.g. renaming the binary as well), would result in a breaking change which I see as out of scope for this PR.

Good call, we can figure this out separately.

One area I focused the review on were packages moved to openshell-core. Generally these looked good. Moving grpc client helpers, policy infra, provider components to core all make sense. Left a comment around the netns/nft_ruleset moves.

let resolve_proto = proto.clone();
let resolve_pid = entrypoint_pid.clone();
tokio::spawn(async move {
let pid = resolve_pid.load(Ordering::Acquire);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(from codex review)

run_networking starts before run_process stores the child PID, so loading entrypoint_pid once here usually captures 0. The retry loop then keeps probing /proc/0/root and never reloads OPA with the real workload root, which regresses symlinked binary policies such as /usr/bin/python3.

Suggested fix: wait for a non-zero PID separately, then start the existing /proc//root probe loop. That keeps the old 5s retry window focused on “is proc root ready?” instead of spending it before the process has even spawned.

  let mut pid = 0;
  for attempt in 1..=20 {
      pid = resolve_pid.load(Ordering::Acquire);
      if pid != 0 {
          break;
      }
      debug!(attempt, "Entrypoint PID not available yet, retrying symlink resolution");
      tokio::time::sleep(Duration::from_millis(250)).await;
  }

  if pid == 0 {
      warn!("Entrypoint PID not available; binary symlink resolution skipped");
      return;
  }

  let probe_path = format!("/proc/{pid}/root/");
  for attempt in 1..=10 {
      tokio::time::sleep(Duration::from_millis(500)).await;
      if std::fs::metadata(&probe_path).is_ok() {
          // existing reload_from_proto_with_pid(&resolve_proto, pid) block
          return;
      }
  }

name = "openshell-sandbox"
path = "src/main.rs"

[dependencies]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that openshell-sandbox is acting as the orchestrator over the process/network leaf crates, this manifest still appears to carry a lot of leaf-only dependencies from the old monolith (regorus, russh, rcgen, tokio-rustls, ipnet, apollo-parser, openshell-router, etc.).

Can we prune the direct deps here to only what the orchestrator actually uses?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

pub mod metadata;
pub mod net;
#[cfg(target_os = "linux")]
pub mod netns;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

netns/nft_ruleset probably don't belong in openshell-core. These modules are privileged Linux supervisor runtime implementation: they create namespaces, manage veths, invoke ip/nsenter/nft, install bypass rules, and emit sandbox OCSF events. That makes core own sandbox enforcement machinery, not just shared types/config.

Can we keep this out of core? The process leaf appears to only need the namespace fd for setns, so one option is for the network/orchestrator side to own the NetworkNamespace RAII handle and pass an Option<i32> fd into run_process. If more sharing is needed, a small supervisor-runtime/netns crate would preserve the no process <-> network dependency rule without expanding openshell-core into privileged runtime code.

@drew
Copy link
Copy Markdown
Collaborator

drew commented Jun 2, 2026

/ok to test 8526c54

rrhubenov added 2 commits June 2, 2026 10:14
Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
The procfs/bypass_monitor/proxy test modules use libc::{fork, exec,
fcntl, kill, waitpid} but the dep wasn't declared in this crate's
Cargo.toml. It was previously satisfied transitively when these
modules lived in openshell-core; the move left the test target
unable to resolve libc.

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>
@rrhubenov rrhubenov force-pushed the refactor/split-sandbox branch from 8526c54 to 7a3fdd7 Compare June 2, 2026 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: decompose openshell-sandbox into independently deployable proxy and runtime crates

4 participants