fix(rlm): harden executor surface, agent search envelope, and OpenCode hook#870
Merged
Conversation
Three pre-existing failures on main blocking the pre-commit hook; this branch needs them resolved before adding RLM hardening commits. 1. Five orchestrator integration tests construct OrchestratorConfig literally and were not updated when GiteaSkillRepoConfig was added in 2d41798. Add gitea_skill_repo: None to each fixture. 2. main.rs:4240 was added in 2d41798 (#1486) calling api.get_thesaurus(&role_name) where role_name is &RoleName but the method expects &str. Also moved role_name twice into SearchQuery. Smallest fix: clone() at the SearchQuery moves and pass role_name.as_str() to get_thesaurus. The whole block is replaced wholesale in the upcoming concepts_matched helper extraction, so this is a temporary unblocker. 3. terraphim_service llm::router_config tests share process env: test_env_overrides sets LLM_PROXY_URL without serialising against test_merged_config_from_role, so parallel execution caused the latter to read the override and fail. Mark both #[serial] and add a defensive remove_var at the start of test_merged_config_from_role. cargo check --workspace --all-targets and cargo test -p terraphim_service --lib llm::router_config are clean after these changes.
…pshots LocalExecutor now respects the caller's ExecutionContext.timeout_ms instead of a hardcoded 30s, and spawns children with kill_on_drop(true) so timed-out processes are reaped instead of leaking. Snapshot operations now return RlmError::NotSupported instead of fabricating a SnapshotId for a no-op, giving callers an honest signal that VM-style state versioning is unavailable on this backend. Adds ExecutionEnvironment::end_session(&SessionId) trait method with a default Ok(()) impl, so backends with per-session resources (Docker) can override without disturbing existing impls. Removes the unused output_dir field and the unnecessary SessionId clone. Tests: 7 unit tests in local.rs (timeout-honoured, child-killed, snapshots-not-supported, end-session-default), 1 in error.rs (NotSupported error code, display, retryability). Refs review of e4d896d..4442671
Closes the TOCTOU race in ensure_container by replacing the parking_lot::RwLock<HashMap<...>> with a DashMap<SessionId, Arc<tokio::sync::Mutex<Option<String>>>>. Concurrent execute_* calls for the same session_id now serialise on the per-session Mutex; only one container is created. Adds DockerExecutor::release_session_container(&SessionId) inherent method, mirroring FirecrackerExecutor::release_session_vm. Also wires ExecutionEnvironment::end_session to call it, so the supervisor can release containers via the trait when needed. Adds default_host_config() applying the permissive profile per 2026-05-15 design decision: 512 MiB memory cap, 256 PIDs cap, all caps dropped, network=bridge (LLM bridge / pip use), readonly_rootfs=false (Python /tmp writes). Replaces sleep 3600 keepalive with sleep infinity so the container outlives any reasonable session and is torn down only by release_session_container, cleanup, or Drop. Replaces snapshot method bodies with RlmError::NotSupported. Replaces `as i32` cast on bollard's i64 exit code with i32::try_from. Removes the #[allow(dead_code)] attribute on the struct (project rule: no dead code) and the now-redundant config field. Tests: 8 in docker.rs (6 daemon-free, 2 gated on --ignored): host config profile, snapshots-not-supported, release-unknown-returns-none, release-removes-and-recreates, concurrent-ensure-no-leak. Refs review of e4d896d..4442671
Three fixes to select_executor: 1. The E2B arm previously logged "Selected E2B backend" then fell through without continuing, misleading operators reading logs. Now logs at debug! and explicitly continues to the next backend. 2. The Docker arm previously short-circuited the entire fallback chain if DockerExecutor::new returned Err (e.g. CLI present, daemon unreachable). Now logs at warn!, pushes to `tried`, and continues to the next backend so Local remains selectable. 3. The Local arm previously logged "Selected Local backend (no isolation)" at info! - falling back to no-isolation is a security-posture downgrade and should be visible in production logs. Now logs at warn! including the `tried` context so operators can see why the secure backends were unavailable. Also caches `is_docker_available()` once per call (avoiding repeated ~50-100ms shell-outs to `docker --version`) and tightens the helper log to "CLI not present" since it does not actually probe the daemon. Tests: 2 new unit tests in mod.rs (Local-preference returns Local, E2B-unimplemented falls through to Local). Refs review of e4d896d..4442671
Adds two public helpers to terraphim_automata::matcher:
- compute_concepts_matched(query, &Thesaurus) -> Vec<String>:
the small ergonomic wrapper around find_matches that the agent's
robot-search envelope needs in two places.
- thesaurus_from_terms(&RoleName, impl IntoIterator<Item=&str>) ->
Thesaurus: builds a minimal Thesaurus from a slice of plain term
strings, assigning each a unique id via NormalizedTerm::with_auto_id.
Replaces inline blocks at main.rs:2177-2182 (offline) and 4247-4268
(server) with the new helpers. The server-mode call site previously
assigned 1u64 to every reconstructed NormalizedTerm; the new
with_auto_id reconstruction is semantically honest. Both sites log
thesaurus failures at debug! instead of silently swallowing them.
The /thesaurus/{role} HTTP API stays untouched.
Tests: 5 in automata::compute_concepts_matched_tests including a
parity test guarding offline/server thesaurus shapes producing the
same concepts_matched output.
Refs review of e4d896d..4442671
terraphim-rlm-hook.sh: - Replace string-interpolated JSON with jq -n --arg/--argjson so prompts containing double quotes, backslashes, or other shell-meta chars no longer corrupt the JSON-RPC request body. - Replace `timeout 30` (GNU coreutils, absent on macOS by default) with a portable POSIX subshell timeout wrapper. The hook now works on macOS without `brew install coreutils`. - Drop the `2>/dev/null` redirect on the MCP invocation so operators retain stderr for debugging failed calls. install.sh now warns (does not block) if `jq` is missing. README.md documents the `jq` dependency, macOS portability, the known-limitation of the JS plugin's per-call MCP-server spawn, and the new tests/test_hook.sh smoke-test entry point. tests/test_hook.sh exercises the hook with a quoted prompt (asserts the captured payload is valid JSON), a passthrough non-RLM command, and the portable timeout wrapper on a stripped PATH that excludes GNU timeout / gtimeout. Tests stub $TERRAPHIM_MCP so no live MCP server is needed. Refs review of e4d896d..4442671
…e_dir default Implements Debug manually so the optional `token` field appears as "***REDACTED***" instead of leaking via panic backtraces or trace dumps. The struct no longer derives Debug. Provides a sensible default for `cache_dir` instead of an empty PathBuf: prefer $XDG_CACHE_HOME, then $HOME/.cache, then $TMPDIR, always under terraphim/skills/. Tests: 2 unit tests in terraphim_orchestrator::config::tests covering the redacted Debug output and the cache_dir default. Refs review of e4d896d..4442671
Addresses every finding from the structural self-review of the
hardening branch.
P1 fixes:
1. Wire end_session trait method through TerraphimRlm::destroy_session
(rlm.rs:194). FirecrackerExecutor overrides end_session to call the
inherent release_session_vm, so destroying a session now releases
per-backend resources via a single trait call. Adds a regression
test (test_destroy_session_calls_executor_end_session) that uses an
instrumented MockExecutor with an AtomicUsize counter.
2. Fix run_with_timeout PID-reuse race in terraphim-rlm-hook.sh by
running the child in its own process group (setsid where available)
and targeting kill at the negative pgid instead of a recyclable pid.
Falls back to plain pid kill on systems without setsid.
P2 fixes:
3. Drop #[ignore] from the two real-Docker integration tests
(test_docker_concurrent_ensure_no_leak and
test_docker_release_session_container_removes). Add a
skip_unless_image_ready helper that gates on both daemon presence
AND the default image being cached locally; tests skip cleanly with
a hint to `docker pull` if the image is missing.
4. Hook smoke test now also asserts prompt content fidelity
(prompt_content_roundtrip) by extracting params.arguments.prompt
from the captured payload and comparing it byte-for-byte to the
original input. Replaces the awk-based RLM args extractor with
bash parameter expansion that preserves whitespace and literal
quotes.
5. Force run_with_timeout test 3 to actually exercise the portable
wrapper on Linux by staging a private $PATH (only sleep, kill,
bash, sh, dash, setsid, grep, cut, head; no timeout / gtimeout) and
sourcing the function definitions directly from the hook script.
Test no longer self-skips on systems shipping GNU coreutils.
5a. Add DockerExecutor::with_host_config(HostConfig) builder so callers
can override the permissive default profile (network=bridge,
readonly_rootfs=false). Test confirms override propagates.
5c. Replace the 200ms fixed sleep in test_local_kills_on_timeout with a
bounded pgrep poll (50ms backoff up to 2s). Marker now includes a
Ulid so concurrent test runs don't collide.
5d clippy: convert five `let mut x = T::default(); x.field = ...`
patterns to struct-update syntax, silencing the
field_reassign_with_default warnings.
Test count: 128 in terraphim_rlm lib (was 126), 4 in hook smoke tests
(was 3). Zero clippy warnings on touched crates. cargo fmt clean.
Audit trail for the disciplined-development workflow that produced this branch. Captures: - Vital-few constraints (no new deps, no trait churn, no API change). - Eliminated alternatives (capability-matcher rewrite, /thesaurus/ API enrichment, container pooling) with rationale per item. - Five risks with assigned mitigations and three open questions resolved during implementation. - File-level change list, function signatures, test strategy, and step sequencing matching the seven commits on this branch.
…ndow
Add last_cron_fire HashMap to track per-agent last fire time, preventing
agents from firing multiple times within the same cron schedule window.
Fixes: agents firing 130+ times instead of ~11 times per schedule cycle
Test: 11 synthetic time tests covering AC1-AC5, compound schedules,
and boundary conditions
Refs #IDX
Contributor
Documentation PreviewYour documentation changes have been deployed to: This preview will be available until the PR is closed. |
Contributor
Documentation PreviewYour documentation changes have been deployed to: This preview will be available until the PR is closed. |
Contributor
Documentation PreviewYour documentation changes have been deployed to: This preview will be available until the PR is closed. |
added 2 commits
May 16, 2026 10:27
…rraphim-ai into task/rlm-executor-hardening
Contributor
Documentation PreviewYour documentation changes have been deployed to: This preview will be available until the PR is closed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses every finding from the structural review of commits
e4d896d3d..4442671e9(LocalExecutor + DockerExecutor + OpenCode plugin + agentconcepts_matchedfix). 9 commits, no new dependencies, no public HTTP API change.fix(rlm-local)— honourctx.timeout_ms,kill_on_drop(true), snapshots returnRlmError::NotSupported. Addsend_session(&SessionId)to theExecutionEnvironmenttrait with defaultOk(())impl.fix(rlm-docker)—DashMap<SessionId, Arc<tokio::sync::Mutex<Option<String>>>>closes the TOCTOU race inensure_container. Addsrelease_session_containerinherent method.host_configwith permissive limits (memory 512 MiB, pids 256, drop ALL caps, network=bridge, readonly_rootfs=false).sleep infinitykeepalive. Removes#[allow(dead_code)].fix(rlm-selector)— E2B arm logsdebug!and continues (no more "Selected E2B" lie); Docker init failure degrades to next backend; Local fallback logswarn!withtriedcontext;is_docker_available()cached once per call.fix(agent)— Extractscompute_concepts_matchedandthesaurus_from_termsto publicterraphim_automata::matcher; both call sites use them; server path useswith_auto_id(no moreid=1u64collision). Parity test guards against future drift.fix(rlm-hook)—jq -n --arg/--argjsonfor safe JSON; portable POSIXrun_with_timeout(process-group kill viasetsid); stderr propagated. 4 smoke tests.fix(orchestrator-config)— ManualDebugredactsGiteaSkillRepoConfig.token;cache_dirdefaults to$XDG_CACHE_HOME/$HOME/.cache/$TMPDIR(underterraphim/skills).fix(rlm-hardening): self-review follow-ups— Wiresend_sessionthroughTerraphimRlm::destroy_session;FirecrackerExecutoroverrides to callrelease_session_vm. Process-group kill in hook watchdog. Removes#[ignore]from Docker integration tests; gates them on image presence (skip_unless_image_ready). Hook smoke test now also asserts prompt content fidelity (prompt_content_roundtrip); awk extractor replaced by bash parameter expansion. AddsDockerExecutor::with_host_config(HostConfig)builder. Replaces 200ms sleep heuristic intest_local_kills_on_timeoutwith bounded pgrep poll. Forcesrun_with_timeouttest to actually exercise wrapper on Linux via private-PATH staging. Five clippyfield_reassign_with_defaultwarnings cleared via struct-update syntax.docs(rlm-hardening)— Phase 1 research and Phase 2 design under.docs/as the disciplined-development audit trail.chore(workspace): unblock pre-commit hook— Pre-existing main-branch fixups:gitea_skill_repo: Noneadded to 5 orchestrator integration test fixtures;.as_str()+clone()workaround for the broken agent code that Step 4 then replaces;#[serial]+ defensiveenv::remove_varon the env-pollutingllm::router_configtests.Test plan
cargo fmt --all -- --checkcleancargo clippy -p terraphim_rlm -p terraphim_automata -p terraphim_agent -p terraphim_orchestrator --all-targetsproduces 0 warnings on touched codecargo test -p terraphim_rlm --features full --lib— 128 / 128 (was 124; integration tests now run when image is cached, otherwise skip cleanly)cargo test -p terraphim_automata --lib— 76 / 76 (5 newcompute_concepts_matched_testsincl. parity)cargo test -p terraphim_orchestrator --lib config::— 68 / 68 (2 newgitea_skill_repotests)cargo test --workspace --lib— every package greenbash examples/opencode-plugin-rlm/tests/test_hook.sh— 4 / 4docker pull python:3.11-slimthen re-run RLM tests to exercise the live integration tests (test_docker_concurrent_ensure_no_leak,test_docker_release_session_container_removes).docs/research-rlm-executor-hardening.mdand.docs/implementation-plan-rlm-executor-hardening.mdDiff size
24 files changed, 2137 insertions(+), 130 deletions(-)(~940 lines are the audit-trail docs)Notes
Refs review of e4d896d3d..4442671e9for traceability.9cf6ec2d3before pushing.Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com