opencode_rlm_env by snimu · Pull Request #1023 · PrimeIntellect-ai/verifiers

snimu · 2026-03-16T03:43:33Z

Description

opencode_rlm_env

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Note

Medium Risk
Adds a new sandboxed agent environment that handles concurrent sub-LLM requests and changes interception/config plumbing; concurrency and request-routing changes could affect rollout stability and token accounting.

Overview
Introduces OpenCodeRLMEnv, extending OpenCodeEnv to support Recursive Language Model sub-agent calls via the snimu/oc plugin, including sandbox bootstrapping (bun + plugin install), header-based sub-LLM detection, and concurrent handling of sub-requests with semaphore-limited parallelism and optional trajectory logging.

Refactors CliAgentEnv request polling into _poll_next_request() to support the new concurrent routing, adjusts first-turn prompt capture to ignore sub-LLM steps, and updates OpenCodeEnv’s generated OpenCode config to use a fixed intercepted/model provider mapping.

Enhances the interception layer to record incoming HTTP headers per request (used for sub-LLM role detection), adds a monitor rubric for main vs sub-LLM token/turn metrics, adds comprehensive tests for config rendering and metrics, and updates environment docs to list OpenCodeEnv/OpenCodeRLMEnv.

^{Written by Cursor Bugbot for commit 8bf7d0a. This will update automatically on new commits. Configure here.}

OpenCodeRLMEnv extends OpenCodeEnv with the snimu/oc RLM plugin for recursive sub-LLM calls. Sets env vars so the plugin routes llm-subcall and subagent calls through the interception proxy with model="sub", enabling concurrent handling and separate token tracking. Includes opencode-rlm-test environment with 3 tasks exercising basic bash, llm-subcall, and subagent capabilities. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Covers constructor defaults, config generation (including shell expansion), run command content, env var setup, sub-LLM detection, state initialization, metrics tracking, and monitor rubric. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SWE-Bench Docker images use sh (dash) as default shell, which doesn't support the bash-only `pipefail` option. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

✅ Fixed: Fire-and-forget task may be garbage collected
- Added _sub_llm_tasks set to track task references and prevent garbage collection, following the pattern used in parent class.
✅ Fixed: New environment not listed in environments README
- Added opencode_rlm_test to environments/README.md under experimental environments section and pattern reference section.
✅ Fixed: Swallowing CancelledError prevents proper task cancellation
- Added raise statement after exception handling in _handle_sub_llm_request to properly propagate CancelledError and other exceptions.

Or push these changes by commenting:

@cursor push 362a510ff0

Preview (362a510ff0)

diff --git a/environments/README.md b/environments/README.md
--- a/environments/README.md
+++ b/environments/README.md
@@ -45,6 +45,9 @@
 - **RLMEnv (Recursive Language Model)**
   - **rlm_secrets**: Puzzle environment testing RLM functionality including root-level tools, sub-LLM tool use, and file operations.
 
+- **OpenCodeRLMEnv (OpenCode with RLM plugin)**
+  - **opencode_rlm_test**: Smoke-test environment for `OpenCodeRLMEnv` demonstrating concurrent sub-LLM handling with the RLM plugin.
+
 - **HarborEnv / CliAgentEnv (CLI agent sandboxes)**
   - **opencode_harbor**: Runs the OpenCode CLI agent on Harbor tasks with API interception via Prime Tunnel.
   - **terminus_harbor**: Runs the Terminus agent on Harbor tasks with API interception via Prime Tunnel.
@@ -75,6 +78,7 @@
 - **CLI agent sandboxes**: `opencode_harbor`, `terminus_harbor`
 - **MCP integration**: `mcp_search_env`
 - **RLM (recursive LLM)**: `rlm_secrets`
+- **OpenCode RLM integration**: `opencode_rlm_test`
 - **Environment and rubric composition**: `math_group`, `math_python`, `wiki_search`
 - **Procedural datasets**: `reasoning_gym_env`
 - **Multimodal**: `mmmu`

diff --git a/verifiers/envs/experimental/opencode_rlm_env.py b/verifiers/envs/experimental/opencode_rlm_env.py
--- a/verifiers/envs/experimental/opencode_rlm_env.py
+++ b/verifiers/envs/experimental/opencode_rlm_env.py
@@ -167,6 +167,7 @@
         self.sub_timeout_ms = sub_timeout_ms
         self.include_sub_llm_in_trajectory = include_sub_llm_in_trajectory
         self._sub_llm_semaphore = asyncio.Semaphore(max_sub_llm_parallelism)
+        self._sub_llm_tasks: set[asyncio.Task[None]] = set()
 
         kwargs.setdefault("run_command_template", RLM_RUN_COMMAND_TEMPLATE)
 
@@ -304,9 +305,11 @@
 
             if self._is_sub_llm_request(intercept):
                 # Fire-and-forget: handled concurrently outside the loop
-                asyncio.create_task(
+                task = asyncio.create_task(
                     self._handle_sub_llm_request(state, request_id, intercept)
                 )
+                self._sub_llm_tasks.add(task)
+                task.add_done_callback(self._sub_llm_tasks.discard)
                 continue
 
             # Main-agent request → return to rollout loop
@@ -349,6 +352,7 @@
             except BaseException as e:
                 error = e
                 logger.warning("Sub-LLM request %s failed: %s", request_id, e)
+                raise
             finally:
                 if intercept.get("stream"):
                     await synthesize_stream(intercept, response, error)

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}

Replaced by opencode-rlm-swe in research-environments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Store asyncio.create_task references in state["_sub_llm_tasks"] set and use done callbacks to clean up. Prevents Python from silently dropping in-flight sub-LLM requests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Allows CancelledError and KeyboardInterrupt to propagate for proper task cancellation during shutdown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…dler Catch BaseException to always resolve the HTTP future (preventing hangs), but re-raise non-Exception types (CancelledError, etc.) after delivery so task cancellation still propagates correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Drain all pending sub-LLM tasks when the agent completes or times out, ensuring metrics and trajectory updates are finalized before scoring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use `prompt` instead of `prompt_messages`, and include all required fields (completion, tokens, reward, advantage, is_truncated, trajectory_id) to match the TrajectoryStep TypedDict. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace model-name substring matching with an explicit X-RLM-Role: sub HTTP header set by the OC plugin. The interception server now captures all request headers in the intercept dict for general-purpose use. Removes: RLM_SUB_MODEL_ID env var, sub_model_identifier param, RLM_LLM_SUBCALL_VIA_PROXY env var (llm-subcall now routes through OPENAI_BASE_URL automatically when set). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace model-name substring matching with an explicit X-RLM-Role: sub HTTP header set by the OC plugin. The interception server now captures all request headers (lowercased) for general-purpose use. Removes: - RLM_SUB_MODEL_ID env var and sub_model_identifier param - RLM_LLM_SUBCALL_VIA_PROXY env var (llm-subcall now routes through OPENAI_BASE_URL automatically when set) - Model-name substring matching Headers are stored with lowercase keys to handle HTTP/2 case normalization correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mikasenghaas

nice, lgtm!

mikasenghaas · 2026-03-16T17:13:59Z

+        self.include_sub_llm_in_trajectory = include_sub_llm_in_trajectory
+        self._sub_llm_semaphore = asyncio.Semaphore(max_sub_llm_parallelism)
+
+        kwargs.setdefault("run_command_template", RLM_RUN_COMMAND_TEMPLATE)


ah, i actually like this as a pattern to distinguish env args from this env or parent envs

mikasenghaas · 2026-03-16T17:15:33Z

+    def _is_sub_llm_request(intercept: dict[str, Any]) -> bool:
+        return intercept.get("headers", {}).get("x-rlm-role") == "sub"
+
+    # ------------------------------------------------------------------


i dont like all these section dividers

mikasenghaas · 2026-03-16T19:28:56Z

+    # Request routing
+    # ------------------------------------------------------------------
+
+    async def get_prompt_messages(self, state: State) -> Messages:


maybe put the first part of this method (which i think is shared with cli agent env) in a shared util so we dont dup code here?

mikasenghaas · 2026-03-16T19:30:16Z

+        )
+        # Only count non-empty turns (skip the synthetic agent-completed step)
+        if prompt:
+            self._update_main_metrics(state, response)


what's the reason to compute metrics at rollout runtime as opposed to in the rubric reward fn?

Copied it from the RLMEnv where some time ago I think I was planning to live-update metrics bc the rollouts were so long, but then didn't; I can undo it.

oh wait there is a reason: the sub-LLM calls aren't guaranteed to be in the trajectory (and by default, they aren't), so we need to collect those metrics incrementally. I could do it afterward for the root-LLM but it save ~1 line of code, so it's probably not worth it.

Move the tunnel/completion/timeout polling loop from get_prompt_messages into _poll_next_request on CliAgentEnv. OpenCodeRLMEnv now calls this helper instead of duplicating the loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… in rubric - Extract _poll_next_request into CliAgentEnv so the RLM env reuses the polling loop instead of duplicating it - Move main-agent metric computation from get_model_response override to the rubric (computed from trajectory at scoring time) - Remove get_model_response override and _update_main_metrics - Add cleanup handler to cancel in-flight sub-LLM tasks on rollout end Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace ${OPENAI_MODEL} shell expansion with a fixed "intercepted/model" provider/model pair, matching the opencode_harbor pattern. The model name doesn't matter since all API calls go through the interception proxy. This fixes the ProviderModelNotFoundError when users pass model names without a provider/ prefix (e.g. gpt-5-mini instead of openai/gpt-5-mini). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…03-13 merge main

- Use self.logger instead of module-level logger - Remove stale get_model_response override and _update_main_metrics (main metrics are now computed in the rubric from trajectory) - Remove unused imports (logging, MessageType, SamplingArgs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The get_model_response override and _update_main_metrics were removed but the rubric was still reading main_* from state (always 0). Now main_turns/main_prompt_tokens/main_completion_tokens are computed from state["trajectory"] at scoring time. Sub-LLM metrics remain in state (accumulated during rollout). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Filter out trajectory steps with extras.is_sub_llm_call=True when computing main_turns/main_prompt_tokens/main_completion_tokens. Prevents double-counting when include_sub_llm_in_trajectory is enabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…king When include_sub_llm_in_trajectory is enabled, sub-LLM steps can be appended before the first main step, making len(trajectory) > 0. Use has_main_step check instead so state["prompt"] is still set correctly on the first main-agent turn. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace pipe (cat | opencode run | tee) with redirect + cat so the script exits with opencode's actual exit code. The pipe masked failures because set -e only checks the last command in a pipeline (tee). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

set -e would exit the script before _oc_exit capture and log emission. Temporarily disable with set +e, capture exit code, re-enable, then cat logs and exit with the real code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…cli_agent_env Revert opencode_env.py config builder to use shell variable expansion instead of hardcoded "intercepted/model" (regression from #1023). Move is_sub_llm_call-aware first-turn prompt check from CliAgentEnv into OpenCodeRLMEnv where it belongs, restoring the simple len(trajectory)==0 check in the base class. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…cli_agent_env (#1042) * fix: revert opencode_env config regression and move RLM logic out of cli_agent_env Revert opencode_env.py config builder to use shell variable expansion instead of hardcoded "intercepted/model" (regression from #1023). Move is_sub_llm_call-aware first-turn prompt check from CliAgentEnv into OpenCodeRLMEnv where it belongs, restoring the simple len(trajectory)==0 check in the base class. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update test to match reverted shell variable expansion in opencode config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

snimu and others added 3 commits March 15, 2026 16:13

fix: use set -e instead of set -eo pipefail in RLM run command

df96752

SWE-Bench Docker images use sh (dash) as default shell, which doesn't support the bash-only `pipefail` option. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Mar 16, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/opencode_rlm_env.py

Comment thread environments/opencode_rlm_test/opencode_rlm_test.py Outdated

Comment thread verifiers/envs/experimental/opencode_rlm_env.py Outdated

snimu and others added 3 commits March 15, 2026 20:49

remove opencode-rlm-test smoke-test environment

9f10b38

Replaced by opencode-rlm-swe in research-environments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: catch Exception instead of BaseException in sub-LLM handler

ac6e780

Allows CancelledError and KeyboardInterrupt to propagate for proper task cancellation during shutdown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Mar 16, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/opencode_rlm_env.py

Comment thread verifiers/envs/experimental/opencode_rlm_env.py

snimu and others added 2 commits March 15, 2026 21:03

fix: await in-flight sub-LLM tasks before exiting rollout loop

0e4bfa7

Drain all pending sub-LLM tasks when the agent completes or times out, ensuring metrics and trajectory updates are finalized before scoring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cursor bot reviewed Mar 16, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/opencode_rlm_env.py

Comment thread verifiers/envs/experimental/opencode_rlm_env.py

snimu and others added 4 commits March 15, 2026 21:26

docs: add OpenCodeEnv and OpenCodeRLMEnv to environments list

3688927

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mikasenghaas reviewed Mar 16, 2026

View reviewed changes

snimu and others added 2 commits March 16, 2026 14:20

remove section dividers

6f03aa0

cursor bot reviewed Mar 16, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/opencode_rlm_env.py

snimu and others added 4 commits March 16, 2026 15:33

Merge remote-tracking branch 'origin/main' into sebastian/ocrlm-2026-…

65fbcfa

…03-13 merge main

cursor bot reviewed Mar 19, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/opencode_rlm_env.py

cursor bot reviewed Mar 19, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/opencode_rlm_env.py

cursor bot reviewed Mar 19, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/opencode_rlm_env.py

Comment thread verifiers/envs/experimental/opencode_rlm_env.py

snimu and others added 2 commits March 19, 2026 15:43

cursor bot reviewed Mar 19, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/opencode_rlm_env.py

snimu merged commit 5d84c1b into main Mar 19, 2026
6 checks passed

mikasenghaas mentioned this pull request Mar 19, 2026

fix: revert opencode_env config regression and move RLM logic out of cli_agent_env #1042

Merged

Conversation

snimu commented Mar 16, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mikasenghaas left a comment

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

snimu Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

snimu Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

snimu commented Mar 16, 2026 •

edited by cursor bot

Loading

cursor bot left a comment •

edited

Loading