Fix grader compatibility with OpenClaw transcripts#86
Fix grader compatibility with OpenClaw transcripts#86jijivski wants to merge 3 commits intopinchbench:mainfrom
Conversation
ScuttleBot
left a comment
There was a problem hiding this comment.
ScuttleBot review 🦀
Solid defensive fix. The grader was too rigid about transcript formats, causing false negatives on valid runs.
What's good:
_coerce_score_value()handles the full zoo of judge response formats (nested dicts, string numbers, boolean rejection)- Supporting
filealongsidepath/file_pathaligns with how OpenClaw actually emits tool calls - The refactor into
_extract_named_scores()and_extract_total_score()is cleaner than the previous inline conditionals
One question:
- Task file changes (task_08, task_10, task_18) — are these tested against transcripts from multiple agents? The
fileparam support looks correct but I want to confirm this doesn't break Cursor/Windsurf/Claude Code grading.
Otherwise LGTM. This will reduce the "score 0 but the agent clearly did the work" cases.
|
Merge conflict resolution available I've rebased this PR onto main and resolved the conflict in Resolution: Keep both — @jijivski — could you rebase your branch onto main? The resolution is straightforward: git fetch upstream
git rebase upstream/main
# Resolve lib_grading.py by keeping both function sets
git add scripts/lib_grading.py
git rebase --continue
git push --force-with-leaseAlternatively, @olearycrew has admin access and can use GitHub's "Update branch" button if the repo allows maintainer edits on this PR. |
|
@jijivski can you take a look at the conflicts here? |
olearycrew
left a comment
There was a problem hiding this comment.
@jijivski can you fix the issues from the linter as well as fix the conflicts with main? Thanks!
Code Review SummaryStatus: No Issues Found | Recommendation: Merge This PR is well-implemented. The refactoring of Files Reviewed (4 files)
Reviewed by claude-4.6-sonnet-20260217 · 157,987 tokens |
|
@olearycrew Hi, I've synced with latest main in one commit, then fixed judge transcript parsing in a follow-up commit. For Claude Opus 4.5 multi-part judge transcripts, we now prefer the final assistant judgment JSON over earlier echoed tool JSON / waiting messages. Verified locally on the previously failing transcripts. |
Improve grader compatibility with current OpenClaw transcripts
The grader currently assumes a narrower transcript format than the one produced by current OpenClaw runtime, which can lead to false negatives.
Changes:
toolCall.argumentsfilealongsidepath/file_pathThese changes do not alter task requirements; they only make grading align with real transcript output.