Releases: PrimeIntellect-ai/verifiers
Releases · PrimeIntellect-ai/verifiers
v0.1.13.dev7
Verifiers v0.1.13.dev7 Release Notes
Date: 04/24/2026
Highlights since v0.1.13.dev6
rlm_harnessswaps turn-based context caps for token-based auto-compaction: newsummarize_at_tokens: int | Nonekwarg maps toRLM_SUMMARIZE_AT_TOKENS, whilerlm_max_turns_in_context/RLM_MAX_TURNS_IN_CONTEXTare removed to match upstreamrlm.summarizealso drops out of the defaultrlm_toolsset. Invalid shapes fail at harness-build time instead of deep inside the sandbox.- Reverted
TaskSet.filter/.takereturningSelf(originally #1232) — the change broke Python 3.10/3.11 compatibility. CI now exercises the 3.10 and 3.11 test matrices so the fix can be restored with confidence.
Changes included in v0.1.13.dev7 (since v0.1.13.dev6)
Features and enhancements
- rlm_harness: add
summarize_at_tokens, droprlm_max_turns_in_context(#1236)
Fixes and maintenance
- Revert "types: TaskSet.filter / .take return Self, not TaskSet (#1232)" (#1237)
- ci: add Python 3.10 and 3.11 to the test matrix (#1237)
Full Changelog: v0.1.13.dev6...v0.1.13.dev7
v0.1.13.dev6
Verifiers v0.1.13.dev6 Release Notes
Date: 04/23/2026
Highlights since v0.1.13.dev5
rlm_harnessis now the single source of truth for RLM_* sandbox env vars. New kwargsrlm_max_turns,rlm_max_turns_in_context,rlm_exec_timeoutmap 1:1 onto the matching env vars onHarness.environment_varsand merge into the sandbox viaComposableEnv.build_env_vars(harness-wins). Research envs can stop setting these viaComposableEnv(environment_vars=…)— pass them through as harness kwargs instead.TaskSet.filter/.takenow returnSelf, notTaskSet, so subclass types survive taskset chaining for downstream typed consumers.
Changes included in v0.1.13.dev6 (since v0.1.13.dev5)
Features and enhancements
- rlm_harness: own RLM_MAX_TURNS / _IN_CONTEXT / _EXEC_TIMEOUT env vars (#1229)
Fixes and maintenance
- types: TaskSet.filter / .take return Self, not TaskSet (#1232)
Full Changelog: v0.1.13.dev5...v0.1.13.dev6
v0.1.13.dev5
Verifiers v0.1.13.dev5 Release Notes
Date: 04/22/2026
Highlights since v0.1.13.dev4
- Made the interception proxy's streaming response resilient to upstream cuts: 10s SSE keepalive comments keep idle streams warm, per-chunk
asyncio.sleep(0)forces an event-loop yield so content and close can't race the transport flush under warmup-burst contention, and transport exceptions at prepare/write/write_eof are surfaced asStreamInterruptedintostate["error"]so rollouts reschedule instead of looking like clean zero-turn completions. - Added a new experimental
mini_swe_agentcomposable harness (pip/uv install with SHA256-verified wheel download), exported alongside existingrlmandopencodeharnesses. - Extended
SandboxMixinto cover VM sandboxes in addition to containers (including GPU VMs viaCreateSandboxRequest), with documentation clarifying feature parity (file I/O, background jobs, cleanup) and container-only features (port exposure, SSH).
Changes included in v0.1.13.dev5 (since v0.1.13.dev4)
Latest changes from main
- Includes the latest
mainchanges through the interception proxy streaming resilience fix (#1194), along with themini_swe_agentharness (#1219) andSandboxMixinVM sandbox support/docs (#1222).
Features and enhancements
Fixes and maintenance
- fix: make interception proxy streaming resilient to upstream cuts (#1194)
Full Changelog: v0.1.13.dev4...v0.1.13.dev5
v0.1.13.dev4
Verifiers v0.1.13.dev4 Release Notes
Date: 04/22/2026
Highlights since v0.1.13.dev3
- RLM harness: new
rlm_toolskwarg sets bothHarness.tool_names(forToolMonitorRubric) and the sandboxRLM_TOOLSenv var from a single source, plus newHarness.environment_varsfield merged harness-wins-on-conflict byComposableEnv. - Refactored experimental RLM checkout caching;
DEFAULT_RLM_BRANCHrenamed toDEFAULT_RLM_REFandrlm_harness(..., rlm_branch=...)renamed torlm_ref=to reflect that any git ref (branch, tag, sha) is accepted. - Added
SandboxTimeoutsdataclass centralizing per-operation sandbox HTTP timeouts. - Expanded task coverage with SWE-rebench-V2 and a multilingual SWESmith taskset, plus a
filter_fnkwarg on all tasksets for ad-hoc row filtering. vf-eval: renamed-d/--debugto--disable-tuiand--tuito--fullscreenfor clearer intent.- RLM rollout metrics (context tokens, programmatic tool calls) exposed to verifiers and auto-merged by the composable env.
Changes included in v0.1.13.dev4 (since v0.1.13.dev3)
Features and enhancements
- vf-eval: replace -d/--debug with --disable-tui, rename --tui to --fullscreen (#1183)
- Expose RLM metrics to verifiers (#1195)
- Add streaming observability + resume to TaskSet.validate() (#1169)
- Refactor experimental RLM checkout caching (#1202)
- feat: add filter_fn kwarg to all tasksets for ad-hoc row filtering (#1199)
- feat: add multilingual SWESmithTaskSet (#1186)
- feat: add SWE-rebench-V2 TaskSet (#1187)
- HarborMCPMixin (#1146)
- feat: SandboxTimeouts dataclass — centralize per-operation sandbox HTTP timeouts (#1207)
- Run SWE-Lego eval via dataset's canonical test_cmd (#1205)
- Authenticate interception server via INTERCEPTION_SECRET (#1180)
- feat: revert agent test edits at grading (swe_lego, swe_rebench_v2) (#1212)
- AgentError: rollout_id, sandbox_id, ... (#1218)
- Remove RLM_DEFAULT_TOOL_NAMES, accept rlm_tools (#1223)
- r2e_gym: add hide_tests_from_agent flag + expose instance_id/repo aliases (#1208)
- feat(rlm): upload a /usr/local/bin/git shim, gated by
allow_git(#1225)
Fixes and maintenance
- Keep harness metrics merge inside experimental composable env (#1201)
- Propagate typed exceptions from SWE/Harbor validate_instance (#1204)
- fix: pass explicit 60s timeout to get_background_job in poll_job_completion (#1206)
- fix: bump opencode harness default release to v1.1.63-rl2 (#1184)
- validate(): extract resume-file parsing into a named helper (#1209)
- fix: SandboxTimeouts fields must be int (sidecar deserializes as u64) (#1210)
- fix: respect framework-injected OPENAI_API_KEY in RLM and opencode harnesses (#1213)
- fix: offload composable _upload_dir tar build to thread (#1224)
Full Changelog: v0.1.13.dev3...v0.1.13.dev4
v0.1.13.dev3
Verifiers v0.1.13.dev3 Release Notes
Date: 04/19/2026
Highlights since v0.1.13.dev2
- Propagated interception-stream write failures into rollout state as
StreamInterruptedso truncated agent streams no longer surface as silent clean exits. - Made RLM checkout resolution lazy in the composable harness, so loading RLM-based environments no longer clones the private checkout up front.
Changes included in v0.1.13.dev3 (since v0.1.13.dev2)
Features and enhancements
- make download of rlm lazy (#1192)
Fixes and maintenance
- fix: propagate interception-stream cuts into rollout state (#1191)
Full Changelog: v0.1.13.dev2...v0.1.13.dev3
v0.1.13.dev2
Verifiers v0.1.13.dev2 Release Notes
Date: 04/19/2026
Highlights since v0.1.13.dev1
- Added richer token usage reporting, including final-context token metrics, updated displays, and API docs.
- Expanded
vf-tuicompare mode so you can inspect any numeric metric with inline selection and adaptive bucketing. - Improved composable/RLM harness integration with harness-owned upload dirs, cached local RLM checkouts, and auto-registered tool monitoring from
harness.tool_names. - Surfaced CLI agent crashes as infra errors even after prior turns, and now include full traces in agent error logs for debugging.
- Removed dead RLM tool config constants from the composable harness exports.
Changes included in v0.1.13.dev2 (since v0.1.13.dev1)
Features and enhancements
- Better token count metrics (#1108)
- vf-tui: compare for all metrics (#1117)
- harness.get_upload_dirs; reduce rlm github requests (#1178)
- tool-env: ToolMonitorRubric takes tool_names instead of tools (#1179)
- feat: auto-register ToolMonitorRubric from harness.tool_names (#1181)
- feat: include full trace in agent error logs (#1185)
Fixes and maintenance
- cli-agent: surface agent crashes as infra errors after any turn (#1177)
- Remove dead RLM tool config constants (#1189)
Full Changelog: v0.1.13.dev1...v0.1.13.dev2
v0.1.13.dev1
Verifiers v0.1.13.dev1 Release Notes
Date: 04/18/2026
Highlights since v0.1.12
- Added the SWELego-Real TaskSet (PrimeIntellect fork, filtered upstream) for broader SWE benchmark coverage.
- Added
timeout_minuteskwarg to R2E, SWEBench, Multi-SWE, and OpenSWE tasksets for finer-grained per-task timeout control. - Surfaced agent timeout as
state['error']inCliAgentEnvso timeouts are visible in eval results. - Fixed
CliAgentEnvpoll loop to honorself.poll_intervalconsistently. - Bumped
prime-sandboxesto 0.2.20.
Changes included in v0.1.13.dev1 (since v0.1.12)
Features and enhancements
- feat: add SWELego-Real TaskSet (PrimeIntellect fork, filtered upstream) (#1149)
- feat: add
timeout_minuteskwarg to R2E / SWEBench / Multi-SWE / OpenSWE tasksets (#1171)
Fixes and maintenance
- fix: surface agent timeout as
state['error']inCliAgentEnv(#1170) - fix: honor
self.poll_intervalinCliAgentEnvpoll loop (#1173) - chore: bump prime-sandboxes to 0.2.20 (#1174)
Full Changelog: v0.1.12...v0.1.13.dev1
v0.1.12
Verifiers v0.1.12 Release Notes
Date: 04/17/2026
Full Changelog: v0.1.11...v0.1.12
Highlights since v0.1.11
- Landed a new composable Task/Agent/Environment architecture and upstreamed opencode/RLM harnesses and swe/lean/math/cp/harbor tasksets into
verifiers.envs.experimental.composable, so downstream environments can depend on them directly instead of via research-environments. - Major
RLMEnvoverhaul: newRLMPromptBuilder, context dropping withsummarize_turns,max_turns_in_context, sub-LLM toggle, removed RLM-internal branding from model-visible prompts, richer metrics, hardened root-tool transport (no unsafe pickle), and a reworked harness install flow that runs from a uv workspace checkout. - Runtime performance and reliability improvements including executor autoscaling, incremental metrics, threaded file I/O, event loop lag monitoring, multi-worker env server support, GC tuning before accepting requests,
setproctitlelabels, dead-tunnel auto-recovery inCliAgentEnv, and safer task cancellation paths. - Richer
vf-tuiwith a log viewer, run comparison mode, toggleable markdown/reasoning rendering, rollout and unique-prompt counts with responsive layout, and saved-state columns in the info view. - Expanded evaluation ergonomics with configurable
output_dir,[[ablation]]model/endpoint overrides,max_total_tokensforMultiTurnEnv,extra_headers_from_stateandheaders/extra_headerssupport inendpoints.toml,X-Session-IDfor DP-aware routing, preserved multimodal media in saved results, and exported eval parser/normalization helpers for Prime CLI reuse. - New Hosted Evaluations docs plus an environment performance guide, refreshed BrowserEnv README, and updated Secrets/Hub guidance across docs and agent skills.
Changes included in v0.1.12 (since v0.1.11)
Clients, RLM, and rollout execution
- feat: composable Task/Agent/Environment architecture (#1067)
- upstream
RlmComposableEnvintoComposableEnv/TaskSet/Harness(#1158) - move harnesses (rlm, opencode) and tasksets (swe, lean, math, cp, harbor) into
verifiers.envs.experimental.composable(#1131) - RLM:
RLMPromptBuilder(#1070) - RLM: context dropping & summarization (#1072)
- replace
remove_conversation_turnswithsummarize_turnsstandard tool (#1095) - add
max_turns_in_context, fix answer extraction, document metrics (#1099) - RLM: inform model about
max_turns_in_contextlimit in scaffolding (#1111) - remove RLM branding from model-visible prompts and messages (#1089)
- change
toolsarg to pass standard tools to root LLM (#1087) - add
enable_sub_llmstoggle toRLMEnv(#1085) - simplify RLM message transcript handling (#1116)
- RLM: improve prompts and metrics (#1102)
- refactor: rename RLM metrics for consistency (#1086)
- remove token/timing info from
llm_batchoutput and addmax_turnsmetric (#1098) - replace timing info in RLM REPL output with root tool time metrics (#1097)
- harden RLM root-tool transport to remove unsafe pickle deserialization (#1104)
- RLM: remove dead code, harden tunnels (#1107)
- run RLM harness from a uv workspace checkout (#1139)
- revert inline install, use rlm's
install.sh(#1144) - clone via git protocol instead of fetching
install.sh(#1159) - update RLM harness test to match git-clone install script (#1160)
- RLM harness: install from arbitrary branch (#1153)
- fix RLM harness to use per-example
AGENT_WORKDIR(#1143) - port rlm harness dedup install script fix (#1133)
- set
RLM_KERNEL_PYTHONto sandbox.venvfor inline imports (#1145) - guard
RLM_KERNEL_PYTHONon successful ipykernel install (#1150) - pin
ipykernel<7for older sandbox Pythons (#1151) - fix RLM bash timeout (#1079)
- fix: handle
CommandTimeoutErrorinRLMEnv(#1069) - deprecate
RolloutGatewayMixin(#1017) - add
NeMoRLChatCompletionsClientfor NeMo Gym model servers (#1141) - feat: send
X-Session-IDheader during eval for DP-aware routing (#1137) - feat: add
extra_headers_from_statetoClientConfig(#1048) - fix: handle
Noneprompt/completion token ids inparse_tokens(#1066) - fix tool args passing (#1106)
- fix TITO in opencode envs: bridge extraction, truncation gate, tool-call handling (#1005)
- fix: remove content
rstripinnormalize_responseto preserve TITO prefix match (#1081)
Env server, sandbox, and runtime reliability
- feat: multi env worker (#1055)
- fix: propagate
json_loggingto env workers (#1138) - tune GC on env server before accepting requests (#1022)
- feat: set process titles on env server and workers (#1082)
- perf: executor autoscaling (#1039)
- perf: incremental metrics (#1036)
- perf: offload file I/O to thread pool (#1037)
- feat: improve event loop lag monitor (#1038)
- fix
get_free_port_pair()TOCTOU race condition (#1013) - fix: task cancellation race + RLM sandbox workers (#1035)
- fix: call
uncancel()after catchingCancelledErrorinprocess_request(#1047) - fix cancelled + serialize error (#1044)
- detect server-side tunnel death and auto-recreate in
CliAgentEnv(#1127) - fix
AgentErrordouble-wrapping inpoll_job_completion(#1130) - fix: clear root logger handlers hijacked by swebench import (#1163)
- use SDK read-file endpoint and bg job handling in
SandboxMixin(#1084)
Evaluation UX, metrics, and configuration
vf-tui: log viewer (#1075)vf-tui: fixes & features (including comparison mode and markdown/reasoning toggles) (#1007)vf-tui: show rollouts and unique prompts, better dynamic width (#1060)- show saved state columns in TUI info view (#1091)
- make
output_dirconfigurable in evals (#1029) - handle ablation
modelandendpoint_idoverrides (#1135) - export eval parser and normalization helpers for Prime CLI reuse (#1135)
- feat: add
max_total_tokensparameter toMultiTurnEnv(#1101) - support
headers/extra_headersinendpoints.toml(#1051) - preserve multimodal media in saved eval results (#1015)
- fix display of custom sampling args (#1025)
- fix output dir logging (#1041)
- fix host-side eval in composable CP wrapper parsing (#1165)
- fix composable
mkdirpath quoting (#1110)
Environments, multimodality, and integrations
- add
BrowserEnvintegration README (#1020) opencode_rlm_env(#1023)- misc improvements to opencode envs (#999)
- perf improvs for opencode envs + math rubric (#1034)
- opencode envs (including
CliAgentEnvhardening, hybrid math rubric overhaul, and log capture) (#1005) - fix: revert
opencode_envconfig regression and move RLM logic out ofcli_agent_env(#1042) - fix opencode config for model names without slash (#1114)
- feat: dataset builder pattern for lazy loading in all environments (#1064)
- add cleanup and teardown lifecycle hooks to
Rubric(#1026) - remove redundant msg normalization + align
env_responseAPI (#1027) - chore: reuse math rubric in hybrid math rubric (#1043)
- perf: math rubric skip overlong answers (#1046)
- fix math rubric timeout (#1096)
- lazily import packages (#1019)
- fix: env tests (#1061)
Docs, CLI, and tooling
- docs: add performance guide for environments (#1045)
- add hosted evaluations section to eval docs (#1040)
- update Secrets guidance (BrowserBase README) (#1056)
- docs: prefix
prime evalmodels (#1125) tomllib/tomliguard for Python 3.10 (#1136)- pin
regex<2026.4.4(missing cp312/cp313 wheels) (#1109) - pin uv
<0.11.0to fix flash-attn resolution (#1057) - bump uv requirement to
>=0.11.1(#1112)
v0.1.12.dev6
Verifiers v0.1.12.dev6 Release Notes
Date: 04/15/2026
Highlights since v0.1.12.dev5
- Exported the eval parser and normalization helpers so Prime CLI can reuse Verifiers eval parsing and config loading.
- Fixed eval ablation config overrides so explicit
modelandendpoint_idsettings behave correctly. - Propagated
json_loggingfrom EnvServer to env workers so worker logs stay structured when JSON logging is enabled. - Fixed the RLM harness install flow to avoid duplicate installs and restored Harbor TOML loading on Python 3.10.
Changes included in v0.1.12.dev6 (since v0.1.12.dev5)
Features and enhancements
- feat: export eval parser and normalization helpers for Prime CLI reuse (#1135)
- fix: handle ablation
modelandendpoint_idoverrides in eval configs (#1135)
Fixes and maintenance
- fix: propagate
json_loggingto env workers (#1138) - fix: port RLM harness dedup install script update (#1133)
- fix: add
tomlifallback for Harbor on Python 3.10 (#1136)
Full Changelog: v0.1.12.dev5...v0.1.12.dev6
v0.1.12.dev5
Verifiers v0.1.12.dev5 Release Notes
Date: 04/14/2026
Highlights since v0.1.12.dev4
- Moved all harnesses and tasksets from research-environments into verifiers proper. Environments can now import directly from
verifiers.envs.experimental.composableinstead of relying on relative path dependencies.
Changes included in v0.1.12.dev5 (since v0.1.12.dev4)
Features and enhancements
- feat: move harnesses (rlm, opencode) and tasksets (swe, lean, math, cp, harbor) into
verifiers.envs.experimental.composable(#1131)
Full Changelog: v0.1.12.dev4...v0.1.12.dev5