24 Apr 02:36

github-actions

b3c054d

v0.1.13.dev7 Latest

Latest

Verifiers v0.1.13.dev7 Release Notes

Date: 04/24/2026

Highlights since v0.1.13.dev6

rlm_harness swaps turn-based context caps for token-based auto-compaction: new summarize_at_tokens: int | None kwarg maps to RLM_SUMMARIZE_AT_TOKENS, while rlm_max_turns_in_context / RLM_MAX_TURNS_IN_CONTEXT are removed to match upstream rlm. summarize also drops out of the default rlm_tools set. Invalid shapes fail at harness-build time instead of deep inside the sandbox.
Reverted TaskSet.filter / .take returning Self (originally #1232) — the change broke Python 3.10/3.11 compatibility. CI now exercises the 3.10 and 3.11 test matrices so the fix can be restored with confidence.

Changes included in v0.1.13.dev7 (since v0.1.13.dev6)

Features and enhancements

rlm_harness: add summarize_at_tokens, drop rlm_max_turns_in_context (#1236)

Fixes and maintenance

Revert "types: TaskSet.filter / .take return Self, not TaskSet (#1232)" (#1237)
ci: add Python 3.10 and 3.11 to the test matrix (#1237)

Full Changelog: v0.1.13.dev6...v0.1.13.dev7

Assets 5

23 Apr 11:52

github-actions

v0.1.13.dev6

68c5382

v0.1.13.dev6

Verifiers v0.1.13.dev6 Release Notes

Date: 04/23/2026

Highlights since v0.1.13.dev5

rlm_harness is now the single source of truth for RLM_* sandbox env vars. New kwargs rlm_max_turns, rlm_max_turns_in_context, rlm_exec_timeout map 1:1 onto the matching env vars on Harness.environment_vars and merge into the sandbox via ComposableEnv.build_env_vars (harness-wins). Research envs can stop setting these via ComposableEnv(environment_vars=…) — pass them through as harness kwargs instead.
TaskSet.filter / .take now return Self, not TaskSet, so subclass types survive taskset chaining for downstream typed consumers.

Changes included in v0.1.13.dev6 (since v0.1.13.dev5)

Features and enhancements

rlm_harness: own RLM_MAX_TURNS / _IN_CONTEXT / _EXEC_TIMEOUT env vars (#1229)

Fixes and maintenance

types: TaskSet.filter / .take return Self, not TaskSet (#1232)

Full Changelog: v0.1.13.dev5...v0.1.13.dev6

Assets 5

22 Apr 21:21

github-actions

v0.1.13.dev5

07981de

v0.1.13.dev5

Verifiers v0.1.13.dev5 Release Notes

Date: 04/22/2026

Highlights since v0.1.13.dev4

Made the interception proxy's streaming response resilient to upstream cuts: 10s SSE keepalive comments keep idle streams warm, per-chunk asyncio.sleep(0) forces an event-loop yield so content and close can't race the transport flush under warmup-burst contention, and transport exceptions at prepare/write/write_eof are surfaced as StreamInterrupted into state["error"] so rollouts reschedule instead of looking like clean zero-turn completions.
Added a new experimental mini_swe_agent composable harness (pip/uv install with SHA256-verified wheel download), exported alongside existing rlm and opencode harnesses.
Extended SandboxMixin to cover VM sandboxes in addition to containers (including GPU VMs via CreateSandboxRequest), with documentation clarifying feature parity (file I/O, background jobs, cleanup) and container-only features (port exposure, SSH).

Changes included in v0.1.13.dev5 (since v0.1.13.dev4)

Latest changes from main

Includes the latest main changes through the interception proxy streaming resilience fix (#1194), along with the mini_swe_agent harness (#1219) and SandboxMixin VM sandbox support/docs (#1222).

Features and enhancements

Add mini-swe-agent harness (#1219)
Update SandboxMixin (#1222)

Fixes and maintenance

fix: make interception proxy streaming resilient to upstream cuts (#1194)

Full Changelog: v0.1.13.dev4...v0.1.13.dev5

Assets 5

22 Apr 13:01

github-actions

v0.1.13.dev4

0d140a6

v0.1.13.dev4

Verifiers v0.1.13.dev4 Release Notes

Date: 04/22/2026

Highlights since v0.1.13.dev3

RLM harness: new rlm_tools kwarg sets both Harness.tool_names (for ToolMonitorRubric) and the sandbox RLM_TOOLS env var from a single source, plus new Harness.environment_vars field merged harness-wins-on-conflict by ComposableEnv.
Refactored experimental RLM checkout caching; DEFAULT_RLM_BRANCH renamed to DEFAULT_RLM_REF and rlm_harness(..., rlm_branch=...) renamed to rlm_ref= to reflect that any git ref (branch, tag, sha) is accepted.
Added SandboxTimeouts dataclass centralizing per-operation sandbox HTTP timeouts.
Expanded task coverage with SWE-rebench-V2 and a multilingual SWESmith taskset, plus a filter_fn kwarg on all tasksets for ad-hoc row filtering.
vf-eval: renamed -d/--debug to --disable-tui and --tui to --fullscreen for clearer intent.
RLM rollout metrics (context tokens, programmatic tool calls) exposed to verifiers and auto-merged by the composable env.

Changes included in v0.1.13.dev4 (since v0.1.13.dev3)

Features and enhancements

vf-eval: replace -d/--debug with --disable-tui, rename --tui to --fullscreen (#1183)
Expose RLM metrics to verifiers (#1195)
Add streaming observability + resume to TaskSet.validate() (#1169)
Refactor experimental RLM checkout caching (#1202)
feat: add filter_fn kwarg to all tasksets for ad-hoc row filtering (#1199)
feat: add multilingual SWESmithTaskSet (#1186)
feat: add SWE-rebench-V2 TaskSet (#1187)
HarborMCPMixin (#1146)
feat: SandboxTimeouts dataclass — centralize per-operation sandbox HTTP timeouts (#1207)
Run SWE-Lego eval via dataset's canonical test_cmd (#1205)
Authenticate interception server via INTERCEPTION_SECRET (#1180)
feat: revert agent test edits at grading (swe_lego, swe_rebench_v2) (#1212)
AgentError: rollout_id, sandbox_id, ... (#1218)
Remove RLM_DEFAULT_TOOL_NAMES, accept rlm_tools (#1223)
r2e_gym: add hide_tests_from_agent flag + expose instance_id/repo aliases (#1208)
feat(rlm): upload a /usr/local/bin/git shim, gated by allow_git (#1225)

Fixes and maintenance

Keep harness metrics merge inside experimental composable env (#1201)
Propagate typed exceptions from SWE/Harbor validate_instance (#1204)
fix: pass explicit 60s timeout to get_background_job in poll_job_completion (#1206)
fix: bump opencode harness default release to v1.1.63-rl2 (#1184)
validate(): extract resume-file parsing into a named helper (#1209)
fix: SandboxTimeouts fields must be int (sidecar deserializes as u64) (#1210)
fix: respect framework-injected OPENAI_API_KEY in RLM and opencode harnesses (#1213)
fix: offload composable _upload_dir tar build to thread (#1224)

Full Changelog: v0.1.13.dev3...v0.1.13.dev4

Assets 5

19 Apr 14:57

github-actions

v0.1.13.dev3

3fda660

v0.1.13.dev3

Verifiers v0.1.13.dev3 Release Notes

Date: 04/19/2026

Highlights since v0.1.13.dev2

Propagated interception-stream write failures into rollout state as StreamInterrupted so truncated agent streams no longer surface as silent clean exits.
Made RLM checkout resolution lazy in the composable harness, so loading RLM-based environments no longer clones the private checkout up front.

Changes included in v0.1.13.dev3 (since v0.1.13.dev2)

Features and enhancements

make download of rlm lazy (#1192)

Fixes and maintenance

fix: propagate interception-stream cuts into rollout state (#1191)

Full Changelog: v0.1.13.dev2...v0.1.13.dev3

Assets 5

19 Apr 14:07

github-actions

v0.1.13.dev2

da787bd

v0.1.13.dev2

Verifiers v0.1.13.dev2 Release Notes

Date: 04/19/2026

Highlights since v0.1.13.dev1

Added richer token usage reporting, including final-context token metrics, updated displays, and API docs.
Expanded vf-tui compare mode so you can inspect any numeric metric with inline selection and adaptive bucketing.
Improved composable/RLM harness integration with harness-owned upload dirs, cached local RLM checkouts, and auto-registered tool monitoring from harness.tool_names.
Surfaced CLI agent crashes as infra errors even after prior turns, and now include full traces in agent error logs for debugging.
Removed dead RLM tool config constants from the composable harness exports.

Changes included in v0.1.13.dev2 (since v0.1.13.dev1)

Features and enhancements

Better token count metrics (#1108)
vf-tui: compare for all metrics (#1117)
harness.get_upload_dirs; reduce rlm github requests (#1178)
tool-env: ToolMonitorRubric takes tool_names instead of tools (#1179)
feat: auto-register ToolMonitorRubric from harness.tool_names (#1181)
feat: include full trace in agent error logs (#1185)

Fixes and maintenance

cli-agent: surface agent crashes as infra errors after any turn (#1177)
Remove dead RLM tool config constants (#1189)

Full Changelog: v0.1.13.dev1...v0.1.13.dev2

Assets 5

18 Apr 12:31

github-actions

v0.1.13.dev1

4458c0c

v0.1.13.dev1

Verifiers v0.1.13.dev1 Release Notes

Date: 04/18/2026

Highlights since v0.1.12

Added the SWELego-Real TaskSet (PrimeIntellect fork, filtered upstream) for broader SWE benchmark coverage.
Added timeout_minutes kwarg to R2E, SWEBench, Multi-SWE, and OpenSWE tasksets for finer-grained per-task timeout control.
Surfaced agent timeout as state['error'] in CliAgentEnv so timeouts are visible in eval results.
Fixed CliAgentEnv poll loop to honor self.poll_interval consistently.
Bumped prime-sandboxes to 0.2.20.

Changes included in v0.1.13.dev1 (since v0.1.12)

Features and enhancements

feat: add SWELego-Real TaskSet (PrimeIntellect fork, filtered upstream) (#1149)
feat: add timeout_minutes kwarg to R2E / SWEBench / Multi-SWE / OpenSWE tasksets (#1171)

Fixes and maintenance

fix: surface agent timeout as state['error'] in CliAgentEnv (#1170)
fix: honor self.poll_interval in CliAgentEnv poll loop (#1173)
chore: bump prime-sandboxes to 0.2.20 (#1174)

Full Changelog: v0.1.12...v0.1.13.dev1

Assets 5

17 Apr 17:35

github-actions

v0.1.12

1979705

v0.1.12

Verifiers v0.1.12 Release Notes

Date: 04/17/2026

Full Changelog: v0.1.11...v0.1.12

Highlights since v0.1.11

Landed a new composable Task/Agent/Environment architecture and upstreamed opencode/RLM harnesses and swe/lean/math/cp/harbor tasksets into verifiers.envs.experimental.composable, so downstream environments can depend on them directly instead of via research-environments.
Major RLMEnv overhaul: new RLMPromptBuilder, context dropping with summarize_turns, max_turns_in_context, sub-LLM toggle, removed RLM-internal branding from model-visible prompts, richer metrics, hardened root-tool transport (no unsafe pickle), and a reworked harness install flow that runs from a uv workspace checkout.
Runtime performance and reliability improvements including executor autoscaling, incremental metrics, threaded file I/O, event loop lag monitoring, multi-worker env server support, GC tuning before accepting requests, setproctitle labels, dead-tunnel auto-recovery in CliAgentEnv, and safer task cancellation paths.
Richer vf-tui with a log viewer, run comparison mode, toggleable markdown/reasoning rendering, rollout and unique-prompt counts with responsive layout, and saved-state columns in the info view.
Expanded evaluation ergonomics with configurable output_dir, [[ablation]] model/endpoint overrides, max_total_tokens for MultiTurnEnv, extra_headers_from_state and headers/extra_headers support in endpoints.toml, X-Session-ID for DP-aware routing, preserved multimodal media in saved results, and exported eval parser/normalization helpers for Prime CLI reuse.
New Hosted Evaluations docs plus an environment performance guide, refreshed BrowserEnv README, and updated Secrets/Hub guidance across docs and agent skills.

Changes included in v0.1.12 (since v0.1.11)

Clients, RLM, and rollout execution

feat: composable Task/Agent/Environment architecture (#1067)
upstream RlmComposableEnv into ComposableEnv/TaskSet/Harness (#1158)
move harnesses (rlm, opencode) and tasksets (swe, lean, math, cp, harbor) into verifiers.envs.experimental.composable (#1131)
RLM: RLMPromptBuilder (#1070)
RLM: context dropping & summarization (#1072)
replace remove_conversation_turns with summarize_turns standard tool (#1095)
add max_turns_in_context, fix answer extraction, document metrics (#1099)
RLM: inform model about max_turns_in_context limit in scaffolding (#1111)
remove RLM branding from model-visible prompts and messages (#1089)
change tools arg to pass standard tools to root LLM (#1087)
add enable_sub_llms toggle to RLMEnv (#1085)
simplify RLM message transcript handling (#1116)
RLM: improve prompts and metrics (#1102)
refactor: rename RLM metrics for consistency (#1086)
remove token/timing info from llm_batch output and add max_turns metric (#1098)
replace timing info in RLM REPL output with root tool time metrics (#1097)
harden RLM root-tool transport to remove unsafe pickle deserialization (#1104)
RLM: remove dead code, harden tunnels (#1107)
run RLM harness from a uv workspace checkout (#1139)
revert inline install, use rlm's install.sh (#1144)
clone via git protocol instead of fetching install.sh (#1159)
update RLM harness test to match git-clone install script (#1160)
RLM harness: install from arbitrary branch (#1153)
fix RLM harness to use per-example AGENT_WORKDIR (#1143)
port rlm harness dedup install script fix (#1133)
set RLM_KERNEL_PYTHON to sandbox .venv for inline imports (#1145)
guard RLM_KERNEL_PYTHON on successful ipykernel install (#1150)
pin ipykernel<7 for older sandbox Pythons (#1151)
fix RLM bash timeout (#1079)
fix: handle CommandTimeoutError in RLMEnv (#1069)
deprecate RolloutGatewayMixin (#1017)
add NeMoRLChatCompletionsClient for NeMo Gym model servers (#1141)
feat: send X-Session-ID header during eval for DP-aware routing (#1137)
feat: add extra_headers_from_state to ClientConfig (#1048)
fix: handle None prompt/completion token ids in parse_tokens (#1066)
fix tool args passing (#1106)
fix TITO in opencode envs: bridge extraction, truncation gate, tool-call handling (#1005)
fix: remove content rstrip in normalize_response to preserve TITO prefix match (#1081)

Env server, sandbox, and runtime reliability

feat: multi env worker (#1055)
fix: propagate json_logging to env workers (#1138)
tune GC on env server before accepting requests (#1022)
feat: set process titles on env server and workers (#1082)
perf: executor autoscaling (#1039)
perf: incremental metrics (#1036)
perf: offload file I/O to thread pool (#1037)
feat: improve event loop lag monitor (#1038)
fix get_free_port_pair() TOCTOU race condition (#1013)
fix: task cancellation race + RLM sandbox workers (#1035)
fix: call uncancel() after catching CancelledError in process_request (#1047)
fix cancelled + serialize error (#1044)
detect server-side tunnel death and auto-recreate in CliAgentEnv (#1127)
fix AgentError double-wrapping in poll_job_completion (#1130)
fix: clear root logger handlers hijacked by swebench import (#1163)
use SDK read-file endpoint and bg job handling in SandboxMixin (#1084)

Evaluation UX, metrics, and configuration

vf-tui: log viewer (#1075)
vf-tui: fixes & features (including comparison mode and markdown/reasoning toggles) (#1007)
vf-tui: show rollouts and unique prompts, better dynamic width (#1060)
show saved state columns in TUI info view (#1091)
make output_dir configurable in evals (#1029)
handle ablation model and endpoint_id overrides (#1135)
export eval parser and normalization helpers for Prime CLI reuse (#1135)
feat: add max_total_tokens parameter to MultiTurnEnv (#1101)
support headers/extra_headers in endpoints.toml (#1051)
preserve multimodal media in saved eval results (#1015)
fix display of custom sampling args (#1025)
fix output dir logging (#1041)
fix host-side eval in composable CP wrapper parsing (#1165)
fix composable mkdir path quoting (#1110)

Environments, multimodality, and integrations

add BrowserEnv integration README (#1020)
opencode_rlm_env (#1023)
misc improvements to opencode envs (#999)
perf improvs for opencode envs + math rubric (#1034)
opencode envs (including CliAgentEnv hardening, hybrid math rubric overhaul, and log capture) (#1005)
fix: revert opencode_env config regression and move RLM logic out of cli_agent_env (#1042)
fix opencode config for model names without slash (#1114)
feat: dataset builder pattern for lazy loading in all environments (#1064)
add cleanup and teardown lifecycle hooks to Rubric (#1026)
remove redundant msg normalization + align env_response API (#1027)
chore: reuse math rubric in hybrid math rubric (#1043)
perf: math rubric skip overlong answers (#1046)
fix math rubric timeout (#1096)
lazily import packages (#1019)
fix: env tests (#1061)

Docs, CLI, and tooling

docs: add performance guide for environments (#1045)
add hosted evaluations section to eval docs (#1040)
update Secrets guidance (BrowserBase README) (#1056)
docs: prefix prime eval models (#1125)
tomllib/tomli guard for Python 3.10 (#1136)
pin regex<2026.4.4 (missing cp312/cp313 wheels) (#1109)
pin uv <0.11.0 to fix flash-attn resolution (#1057)
bump uv requirement to >=0.11.1 (#1112)

Assets 5

15 Apr 14:58

github-actions

v0.1.12.dev6

89fb727

v0.1.12.dev6

Verifiers v0.1.12.dev6 Release Notes

Date: 04/15/2026

Highlights since v0.1.12.dev5

Exported the eval parser and normalization helpers so Prime CLI can reuse Verifiers eval parsing and config loading.
Fixed eval ablation config overrides so explicit model and endpoint_id settings behave correctly.
Propagated json_logging from EnvServer to env workers so worker logs stay structured when JSON logging is enabled.
Fixed the RLM harness install flow to avoid duplicate installs and restored Harbor TOML loading on Python 3.10.

Changes included in v0.1.12.dev6 (since v0.1.12.dev5)

Features and enhancements

feat: export eval parser and normalization helpers for Prime CLI reuse (#1135)
fix: handle ablation model and endpoint_id overrides in eval configs (#1135)

Fixes and maintenance

fix: propagate json_logging to env workers (#1138)
fix: port RLM harness dedup install script update (#1133)
fix: add tomli fallback for Harbor on Python 3.10 (#1136)

Full Changelog: v0.1.12.dev5...v0.1.12.dev6

Assets 5

14 Apr 16:26

github-actions

v0.1.12.dev5

f9fa84e

v0.1.12.dev5

Verifiers v0.1.12.dev5 Release Notes

Date: 04/14/2026

Highlights since v0.1.12.dev4

Moved all harnesses and tasksets from research-environments into verifiers proper. Environments can now import directly from verifiers.envs.experimental.composable instead of relying on relative path dependencies.

Changes included in v0.1.12.dev5 (since v0.1.12.dev4)

Features and enhancements

feat: move harnesses (rlm, opencode) and tasksets (swe, lean, math, cp, harbor) into verifiers.envs.experimental.composable (#1131)

Full Changelog: v0.1.12.dev4...v0.1.12.dev5

Assets 5

Releases: PrimeIntellect-ai/verifiers

v0.1.13.dev7

Verifiers v0.1.13.dev7 Release Notes

Highlights since v0.1.13.dev6

Changes included in v0.1.13.dev7 (since v0.1.13.dev6)

Features and enhancements

Fixes and maintenance

Uh oh!

v0.1.13.dev6

Verifiers v0.1.13.dev6 Release Notes

Highlights since v0.1.13.dev5

Changes included in v0.1.13.dev6 (since v0.1.13.dev5)

Features and enhancements

Fixes and maintenance

Uh oh!

v0.1.13.dev5

Verifiers v0.1.13.dev5 Release Notes

Highlights since v0.1.13.dev4

Changes included in v0.1.13.dev5 (since v0.1.13.dev4)

Latest changes from main

Features and enhancements

Fixes and maintenance

Uh oh!

v0.1.13.dev4

Verifiers v0.1.13.dev4 Release Notes

Highlights since v0.1.13.dev3

Changes included in v0.1.13.dev4 (since v0.1.13.dev3)

Features and enhancements

Fixes and maintenance

Uh oh!

v0.1.13.dev3

Verifiers v0.1.13.dev3 Release Notes

Highlights since v0.1.13.dev2

Changes included in v0.1.13.dev3 (since v0.1.13.dev2)

Features and enhancements

Fixes and maintenance

Uh oh!

v0.1.13.dev2

Verifiers v0.1.13.dev2 Release Notes

Highlights since v0.1.13.dev1

Changes included in v0.1.13.dev2 (since v0.1.13.dev1)

Features and enhancements

Fixes and maintenance

Uh oh!

v0.1.13.dev1

Verifiers v0.1.13.dev1 Release Notes

Highlights since v0.1.12

Changes included in v0.1.13.dev1 (since v0.1.12)

Features and enhancements

Fixes and maintenance

Uh oh!

v0.1.12

Verifiers v0.1.12 Release Notes

Highlights since v0.1.11

Changes included in v0.1.12 (since v0.1.11)

Clients, RLM, and rollout execution

Env server, sandbox, and runtime reliability

Evaluation UX, metrics, and configuration

Environments, multimodality, and integrations

Docs, CLI, and tooling

Uh oh!

v0.1.12.dev6

Verifiers v0.1.12.dev6 Release Notes

Highlights since v0.1.12.dev5

Changes included in v0.1.12.dev6 (since v0.1.12.dev5)

Features and enhancements

Fixes and maintenance

Uh oh!

v0.1.12.dev5

Verifiers v0.1.12.dev5 Release Notes

Highlights since v0.1.12.dev4

Changes included in v0.1.12.dev5 (since v0.1.12.dev4)

Features and enhancements

Uh oh!