Skip to content

Releases: PrimeIntellect-ai/verifiers

v0.1.13.dev7

24 Apr 02:36
b3c054d

Choose a tag to compare

Verifiers v0.1.13.dev7 Release Notes

Date: 04/24/2026

Highlights since v0.1.13.dev6

  • rlm_harness swaps turn-based context caps for token-based auto-compaction: new summarize_at_tokens: int | None kwarg maps to RLM_SUMMARIZE_AT_TOKENS, while rlm_max_turns_in_context / RLM_MAX_TURNS_IN_CONTEXT are removed to match upstream rlm. summarize also drops out of the default rlm_tools set. Invalid shapes fail at harness-build time instead of deep inside the sandbox.
  • Reverted TaskSet.filter / .take returning Self (originally #1232) — the change broke Python 3.10/3.11 compatibility. CI now exercises the 3.10 and 3.11 test matrices so the fix can be restored with confidence.

Changes included in v0.1.13.dev7 (since v0.1.13.dev6)

Features and enhancements

  • rlm_harness: add summarize_at_tokens, drop rlm_max_turns_in_context (#1236)

Fixes and maintenance

  • Revert "types: TaskSet.filter / .take return Self, not TaskSet (#1232)" (#1237)
  • ci: add Python 3.10 and 3.11 to the test matrix (#1237)

Full Changelog: v0.1.13.dev6...v0.1.13.dev7

v0.1.13.dev6

23 Apr 11:52
68c5382

Choose a tag to compare

Verifiers v0.1.13.dev6 Release Notes

Date: 04/23/2026

Highlights since v0.1.13.dev5

  • rlm_harness is now the single source of truth for RLM_* sandbox env vars. New kwargs rlm_max_turns, rlm_max_turns_in_context, rlm_exec_timeout map 1:1 onto the matching env vars on Harness.environment_vars and merge into the sandbox via ComposableEnv.build_env_vars (harness-wins). Research envs can stop setting these via ComposableEnv(environment_vars=…) — pass them through as harness kwargs instead.
  • TaskSet.filter / .take now return Self, not TaskSet, so subclass types survive taskset chaining for downstream typed consumers.

Changes included in v0.1.13.dev6 (since v0.1.13.dev5)

Features and enhancements

  • rlm_harness: own RLM_MAX_TURNS / _IN_CONTEXT / _EXEC_TIMEOUT env vars (#1229)

Fixes and maintenance

  • types: TaskSet.filter / .take return Self, not TaskSet (#1232)

Full Changelog: v0.1.13.dev5...v0.1.13.dev6

v0.1.13.dev5

22 Apr 21:21
07981de

Choose a tag to compare

Verifiers v0.1.13.dev5 Release Notes

Date: 04/22/2026

Highlights since v0.1.13.dev4

  • Made the interception proxy's streaming response resilient to upstream cuts: 10s SSE keepalive comments keep idle streams warm, per-chunk asyncio.sleep(0) forces an event-loop yield so content and close can't race the transport flush under warmup-burst contention, and transport exceptions at prepare/write/write_eof are surfaced as StreamInterrupted into state["error"] so rollouts reschedule instead of looking like clean zero-turn completions.
  • Added a new experimental mini_swe_agent composable harness (pip/uv install with SHA256-verified wheel download), exported alongside existing rlm and opencode harnesses.
  • Extended SandboxMixin to cover VM sandboxes in addition to containers (including GPU VMs via CreateSandboxRequest), with documentation clarifying feature parity (file I/O, background jobs, cleanup) and container-only features (port exposure, SSH).

Changes included in v0.1.13.dev5 (since v0.1.13.dev4)

Latest changes from main

  • Includes the latest main changes through the interception proxy streaming resilience fix (#1194), along with the mini_swe_agent harness (#1219) and SandboxMixin VM sandbox support/docs (#1222).

Features and enhancements

  • Add mini-swe-agent harness (#1219)
  • Update SandboxMixin (#1222)

Fixes and maintenance

  • fix: make interception proxy streaming resilient to upstream cuts (#1194)

Full Changelog: v0.1.13.dev4...v0.1.13.dev5

v0.1.13.dev4

22 Apr 13:01
0d140a6

Choose a tag to compare

Verifiers v0.1.13.dev4 Release Notes

Date: 04/22/2026

Highlights since v0.1.13.dev3

  • RLM harness: new rlm_tools kwarg sets both Harness.tool_names (for ToolMonitorRubric) and the sandbox RLM_TOOLS env var from a single source, plus new Harness.environment_vars field merged harness-wins-on-conflict by ComposableEnv.
  • Refactored experimental RLM checkout caching; DEFAULT_RLM_BRANCH renamed to DEFAULT_RLM_REF and rlm_harness(..., rlm_branch=...) renamed to rlm_ref= to reflect that any git ref (branch, tag, sha) is accepted.
  • Added SandboxTimeouts dataclass centralizing per-operation sandbox HTTP timeouts.
  • Expanded task coverage with SWE-rebench-V2 and a multilingual SWESmith taskset, plus a filter_fn kwarg on all tasksets for ad-hoc row filtering.
  • vf-eval: renamed -d/--debug to --disable-tui and --tui to --fullscreen for clearer intent.
  • RLM rollout metrics (context tokens, programmatic tool calls) exposed to verifiers and auto-merged by the composable env.

Changes included in v0.1.13.dev4 (since v0.1.13.dev3)

Features and enhancements

  • vf-eval: replace -d/--debug with --disable-tui, rename --tui to --fullscreen (#1183)
  • Expose RLM metrics to verifiers (#1195)
  • Add streaming observability + resume to TaskSet.validate() (#1169)
  • Refactor experimental RLM checkout caching (#1202)
  • feat: add filter_fn kwarg to all tasksets for ad-hoc row filtering (#1199)
  • feat: add multilingual SWESmithTaskSet (#1186)
  • feat: add SWE-rebench-V2 TaskSet (#1187)
  • HarborMCPMixin (#1146)
  • feat: SandboxTimeouts dataclass — centralize per-operation sandbox HTTP timeouts (#1207)
  • Run SWE-Lego eval via dataset's canonical test_cmd (#1205)
  • Authenticate interception server via INTERCEPTION_SECRET (#1180)
  • feat: revert agent test edits at grading (swe_lego, swe_rebench_v2) (#1212)
  • AgentError: rollout_id, sandbox_id, ... (#1218)
  • Remove RLM_DEFAULT_TOOL_NAMES, accept rlm_tools (#1223)
  • r2e_gym: add hide_tests_from_agent flag + expose instance_id/repo aliases (#1208)
  • feat(rlm): upload a /usr/local/bin/git shim, gated by allow_git (#1225)

Fixes and maintenance

  • Keep harness metrics merge inside experimental composable env (#1201)
  • Propagate typed exceptions from SWE/Harbor validate_instance (#1204)
  • fix: pass explicit 60s timeout to get_background_job in poll_job_completion (#1206)
  • fix: bump opencode harness default release to v1.1.63-rl2 (#1184)
  • validate(): extract resume-file parsing into a named helper (#1209)
  • fix: SandboxTimeouts fields must be int (sidecar deserializes as u64) (#1210)
  • fix: respect framework-injected OPENAI_API_KEY in RLM and opencode harnesses (#1213)
  • fix: offload composable _upload_dir tar build to thread (#1224)

Full Changelog: v0.1.13.dev3...v0.1.13.dev4

v0.1.13.dev3

19 Apr 14:57
3fda660

Choose a tag to compare

Verifiers v0.1.13.dev3 Release Notes

Date: 04/19/2026

Highlights since v0.1.13.dev2

  • Propagated interception-stream write failures into rollout state as StreamInterrupted so truncated agent streams no longer surface as silent clean exits.
  • Made RLM checkout resolution lazy in the composable harness, so loading RLM-based environments no longer clones the private checkout up front.

Changes included in v0.1.13.dev3 (since v0.1.13.dev2)

Features and enhancements

  • make download of rlm lazy (#1192)

Fixes and maintenance

  • fix: propagate interception-stream cuts into rollout state (#1191)

Full Changelog: v0.1.13.dev2...v0.1.13.dev3

v0.1.13.dev2

19 Apr 14:07
da787bd

Choose a tag to compare

Verifiers v0.1.13.dev2 Release Notes

Date: 04/19/2026

Highlights since v0.1.13.dev1

  • Added richer token usage reporting, including final-context token metrics, updated displays, and API docs.
  • Expanded vf-tui compare mode so you can inspect any numeric metric with inline selection and adaptive bucketing.
  • Improved composable/RLM harness integration with harness-owned upload dirs, cached local RLM checkouts, and auto-registered tool monitoring from harness.tool_names.
  • Surfaced CLI agent crashes as infra errors even after prior turns, and now include full traces in agent error logs for debugging.
  • Removed dead RLM tool config constants from the composable harness exports.

Changes included in v0.1.13.dev2 (since v0.1.13.dev1)

Features and enhancements

  • Better token count metrics (#1108)
  • vf-tui: compare for all metrics (#1117)
  • harness.get_upload_dirs; reduce rlm github requests (#1178)
  • tool-env: ToolMonitorRubric takes tool_names instead of tools (#1179)
  • feat: auto-register ToolMonitorRubric from harness.tool_names (#1181)
  • feat: include full trace in agent error logs (#1185)

Fixes and maintenance

  • cli-agent: surface agent crashes as infra errors after any turn (#1177)
  • Remove dead RLM tool config constants (#1189)

Full Changelog: v0.1.13.dev1...v0.1.13.dev2

v0.1.13.dev1

18 Apr 12:31
4458c0c

Choose a tag to compare

Verifiers v0.1.13.dev1 Release Notes

Date: 04/18/2026

Highlights since v0.1.12

  • Added the SWELego-Real TaskSet (PrimeIntellect fork, filtered upstream) for broader SWE benchmark coverage.
  • Added timeout_minutes kwarg to R2E, SWEBench, Multi-SWE, and OpenSWE tasksets for finer-grained per-task timeout control.
  • Surfaced agent timeout as state['error'] in CliAgentEnv so timeouts are visible in eval results.
  • Fixed CliAgentEnv poll loop to honor self.poll_interval consistently.
  • Bumped prime-sandboxes to 0.2.20.

Changes included in v0.1.13.dev1 (since v0.1.12)

Features and enhancements

  • feat: add SWELego-Real TaskSet (PrimeIntellect fork, filtered upstream) (#1149)
  • feat: add timeout_minutes kwarg to R2E / SWEBench / Multi-SWE / OpenSWE tasksets (#1171)

Fixes and maintenance

  • fix: surface agent timeout as state['error'] in CliAgentEnv (#1170)
  • fix: honor self.poll_interval in CliAgentEnv poll loop (#1173)
  • chore: bump prime-sandboxes to 0.2.20 (#1174)

Full Changelog: v0.1.12...v0.1.13.dev1

v0.1.12

17 Apr 17:35
1979705

Choose a tag to compare

Verifiers v0.1.12 Release Notes

Date: 04/17/2026

Full Changelog: v0.1.11...v0.1.12

Highlights since v0.1.11

  • Landed a new composable Task/Agent/Environment architecture and upstreamed opencode/RLM harnesses and swe/lean/math/cp/harbor tasksets into verifiers.envs.experimental.composable, so downstream environments can depend on them directly instead of via research-environments.
  • Major RLMEnv overhaul: new RLMPromptBuilder, context dropping with summarize_turns, max_turns_in_context, sub-LLM toggle, removed RLM-internal branding from model-visible prompts, richer metrics, hardened root-tool transport (no unsafe pickle), and a reworked harness install flow that runs from a uv workspace checkout.
  • Runtime performance and reliability improvements including executor autoscaling, incremental metrics, threaded file I/O, event loop lag monitoring, multi-worker env server support, GC tuning before accepting requests, setproctitle labels, dead-tunnel auto-recovery in CliAgentEnv, and safer task cancellation paths.
  • Richer vf-tui with a log viewer, run comparison mode, toggleable markdown/reasoning rendering, rollout and unique-prompt counts with responsive layout, and saved-state columns in the info view.
  • Expanded evaluation ergonomics with configurable output_dir, [[ablation]] model/endpoint overrides, max_total_tokens for MultiTurnEnv, extra_headers_from_state and headers/extra_headers support in endpoints.toml, X-Session-ID for DP-aware routing, preserved multimodal media in saved results, and exported eval parser/normalization helpers for Prime CLI reuse.
  • New Hosted Evaluations docs plus an environment performance guide, refreshed BrowserEnv README, and updated Secrets/Hub guidance across docs and agent skills.

Changes included in v0.1.12 (since v0.1.11)

Clients, RLM, and rollout execution

  • feat: composable Task/Agent/Environment architecture (#1067)
  • upstream RlmComposableEnv into ComposableEnv/TaskSet/Harness (#1158)
  • move harnesses (rlm, opencode) and tasksets (swe, lean, math, cp, harbor) into verifiers.envs.experimental.composable (#1131)
  • RLM: RLMPromptBuilder (#1070)
  • RLM: context dropping & summarization (#1072)
  • replace remove_conversation_turns with summarize_turns standard tool (#1095)
  • add max_turns_in_context, fix answer extraction, document metrics (#1099)
  • RLM: inform model about max_turns_in_context limit in scaffolding (#1111)
  • remove RLM branding from model-visible prompts and messages (#1089)
  • change tools arg to pass standard tools to root LLM (#1087)
  • add enable_sub_llms toggle to RLMEnv (#1085)
  • simplify RLM message transcript handling (#1116)
  • RLM: improve prompts and metrics (#1102)
  • refactor: rename RLM metrics for consistency (#1086)
  • remove token/timing info from llm_batch output and add max_turns metric (#1098)
  • replace timing info in RLM REPL output with root tool time metrics (#1097)
  • harden RLM root-tool transport to remove unsafe pickle deserialization (#1104)
  • RLM: remove dead code, harden tunnels (#1107)
  • run RLM harness from a uv workspace checkout (#1139)
  • revert inline install, use rlm's install.sh (#1144)
  • clone via git protocol instead of fetching install.sh (#1159)
  • update RLM harness test to match git-clone install script (#1160)
  • RLM harness: install from arbitrary branch (#1153)
  • fix RLM harness to use per-example AGENT_WORKDIR (#1143)
  • port rlm harness dedup install script fix (#1133)
  • set RLM_KERNEL_PYTHON to sandbox .venv for inline imports (#1145)
  • guard RLM_KERNEL_PYTHON on successful ipykernel install (#1150)
  • pin ipykernel<7 for older sandbox Pythons (#1151)
  • fix RLM bash timeout (#1079)
  • fix: handle CommandTimeoutError in RLMEnv (#1069)
  • deprecate RolloutGatewayMixin (#1017)
  • add NeMoRLChatCompletionsClient for NeMo Gym model servers (#1141)
  • feat: send X-Session-ID header during eval for DP-aware routing (#1137)
  • feat: add extra_headers_from_state to ClientConfig (#1048)
  • fix: handle None prompt/completion token ids in parse_tokens (#1066)
  • fix tool args passing (#1106)
  • fix TITO in opencode envs: bridge extraction, truncation gate, tool-call handling (#1005)
  • fix: remove content rstrip in normalize_response to preserve TITO prefix match (#1081)

Env server, sandbox, and runtime reliability

  • feat: multi env worker (#1055)
  • fix: propagate json_logging to env workers (#1138)
  • tune GC on env server before accepting requests (#1022)
  • feat: set process titles on env server and workers (#1082)
  • perf: executor autoscaling (#1039)
  • perf: incremental metrics (#1036)
  • perf: offload file I/O to thread pool (#1037)
  • feat: improve event loop lag monitor (#1038)
  • fix get_free_port_pair() TOCTOU race condition (#1013)
  • fix: task cancellation race + RLM sandbox workers (#1035)
  • fix: call uncancel() after catching CancelledError in process_request (#1047)
  • fix cancelled + serialize error (#1044)
  • detect server-side tunnel death and auto-recreate in CliAgentEnv (#1127)
  • fix AgentError double-wrapping in poll_job_completion (#1130)
  • fix: clear root logger handlers hijacked by swebench import (#1163)
  • use SDK read-file endpoint and bg job handling in SandboxMixin (#1084)

Evaluation UX, metrics, and configuration

  • vf-tui: log viewer (#1075)
  • vf-tui: fixes & features (including comparison mode and markdown/reasoning toggles) (#1007)
  • vf-tui: show rollouts and unique prompts, better dynamic width (#1060)
  • show saved state columns in TUI info view (#1091)
  • make output_dir configurable in evals (#1029)
  • handle ablation model and endpoint_id overrides (#1135)
  • export eval parser and normalization helpers for Prime CLI reuse (#1135)
  • feat: add max_total_tokens parameter to MultiTurnEnv (#1101)
  • support headers/extra_headers in endpoints.toml (#1051)
  • preserve multimodal media in saved eval results (#1015)
  • fix display of custom sampling args (#1025)
  • fix output dir logging (#1041)
  • fix host-side eval in composable CP wrapper parsing (#1165)
  • fix composable mkdir path quoting (#1110)

Environments, multimodality, and integrations

  • add BrowserEnv integration README (#1020)
  • opencode_rlm_env (#1023)
  • misc improvements to opencode envs (#999)
  • perf improvs for opencode envs + math rubric (#1034)
  • opencode envs (including CliAgentEnv hardening, hybrid math rubric overhaul, and log capture) (#1005)
  • fix: revert opencode_env config regression and move RLM logic out of cli_agent_env (#1042)
  • fix opencode config for model names without slash (#1114)
  • feat: dataset builder pattern for lazy loading in all environments (#1064)
  • add cleanup and teardown lifecycle hooks to Rubric (#1026)
  • remove redundant msg normalization + align env_response API (#1027)
  • chore: reuse math rubric in hybrid math rubric (#1043)
  • perf: math rubric skip overlong answers (#1046)
  • fix math rubric timeout (#1096)
  • lazily import packages (#1019)
  • fix: env tests (#1061)

Docs, CLI, and tooling

  • docs: add performance guide for environments (#1045)
  • add hosted evaluations section to eval docs (#1040)
  • update Secrets guidance (BrowserBase README) (#1056)
  • docs: prefix prime eval models (#1125)
  • tomllib/tomli guard for Python 3.10 (#1136)
  • pin regex<2026.4.4 (missing cp312/cp313 wheels) (#1109)
  • pin uv <0.11.0 to fix flash-attn resolution (#1057)
  • bump uv requirement to >=0.11.1 (#1112)

v0.1.12.dev6

15 Apr 14:58
89fb727

Choose a tag to compare

Verifiers v0.1.12.dev6 Release Notes

Date: 04/15/2026

Highlights since v0.1.12.dev5

  • Exported the eval parser and normalization helpers so Prime CLI can reuse Verifiers eval parsing and config loading.
  • Fixed eval ablation config overrides so explicit model and endpoint_id settings behave correctly.
  • Propagated json_logging from EnvServer to env workers so worker logs stay structured when JSON logging is enabled.
  • Fixed the RLM harness install flow to avoid duplicate installs and restored Harbor TOML loading on Python 3.10.

Changes included in v0.1.12.dev6 (since v0.1.12.dev5)

Features and enhancements

  • feat: export eval parser and normalization helpers for Prime CLI reuse (#1135)
  • fix: handle ablation model and endpoint_id overrides in eval configs (#1135)

Fixes and maintenance

  • fix: propagate json_logging to env workers (#1138)
  • fix: port RLM harness dedup install script update (#1133)
  • fix: add tomli fallback for Harbor on Python 3.10 (#1136)

Full Changelog: v0.1.12.dev5...v0.1.12.dev6

v0.1.12.dev5

14 Apr 16:26
f9fa84e

Choose a tag to compare

Verifiers v0.1.12.dev5 Release Notes

Date: 04/14/2026

Highlights since v0.1.12.dev4

  • Moved all harnesses and tasksets from research-environments into verifiers proper. Environments can now import directly from verifiers.envs.experimental.composable instead of relying on relative path dependencies.

Changes included in v0.1.12.dev5 (since v0.1.12.dev4)

Features and enhancements

  • feat: move harnesses (rlm, opencode) and tasksets (swe, lean, math, cp, harbor) into verifiers.envs.experimental.composable (#1131)

Full Changelog: v0.1.12.dev4...v0.1.12.dev5