feat: implement new_session:true support for multi-turn task session β¦#330
Open
HaotianChen616 wants to merge 1 commit intopinchbench:mainfrom
Open
Conversation
β¦isolation - Implement new_session: true in lib_agent.py: when a session entry has new_session: true, archive the current transcript, clean up the agent's session state, and generate a new session_id so the agent starts with no conversation history (workspace files are preserved) - Add _archive_transcript() helper for per-session transcript preservation before session resets - Merge all archived session transcripts after all sessions complete so the grading engine can inspect the full conversation history - Document sessions and new_session fields in TASK_TEMPLATE.md with a dedicated Multi-Session Tasks section and updated author checklist - Add task_iterative_code_refine: a 3-session iterative refinement task demonstrating multi-turn conversation and new_session isolation - Add test_multi_session.py with 11 tests covering frontmatter parsing, new_session flag detection, transcript archiving, and integration
Author
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge This PR is well-structured and correctly implements Files Reviewed (6 files)
Reviewed by claude-4.6-sonnet-20260217 Β· 153,273 tokens |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: implement full multi-turn user prompt support with session isolation
Summary
PinchBench currently has partial multi-session support β the
sessionsfield in task frontmatter is processed to send sequential prompts, but thenew_session: trueflag (already defined intask_second_brain.md) is completely ignored by the Python codebase. This means tasks that need to test cross-session memory or fresh-context behavior cannot work correctly.This PR implements full multi-turn user prompt support by:
Implementing
new_session: truesession isolation inlib_agent.pyβ when a session entry hasnew_session: true, the agent's session state is cleaned up and a newsession_idis generated, simulating a user returning after closing the agent. The workspace (and any files created) is preserved across sessions.Adding transcript archiving per session β before starting a new session, the current session's transcript is archived. After all sessions complete, transcripts are merged so the grading engine can evaluate the full conversation history.
Documenting multi-session tasks in
TASK_TEMPLATE.mdβ adds a complete "Multi-Session Tasks" section with field descriptions, usage guidelines, and YAML examples.Adding two new multi-turn tasks:
task_iterative_code_refine.mdβ tests iterative code refinement across 3 sessions, with the final session usingnew_session: trueto verify the agent can work from file-based context alonetask_session_chain_analysis.mdβ tests 4-turn structured code analysis (single-session multi-turn) where the agent reads TypeScript source files, produces JSON chain maps, designs minimal changes, extracts code evidence, and writes a delivery summaryAdding tests β
test_multi_session.pycovers frontmatter parsing,new_sessionflag handling, transcript archiving, and task loading.Changes
scripts/lib_agent.py_archive_transcript()helper to save per-session transcripts before session resetsexecute_openclaw_task()multi-session loop to:current_session_idseparately (changes whennew_session: true)cleanup_agent_sessions()and generate a new session ID whennew_session: truenew_sessionwas usedtasks/TASK_TEMPLATE.mdsessionsandnew_sessionYAML examples to the frontmatter templatetasks/task_iterative_code_refine.md(new)calculator.pyandreview.txttasks/task_session_chain_analysis.md(new)new_session): structured ingest β minimal design β evidence extraction β delivery summarysession.ts,session-store.ts,delivery.ts,agent-command.ts) copied totasks/assets/session_chain/tasks/assets/session_chain/(new)session.tsβ OpenClaw session resolution logic (resolveSession,resolveSessionKeyForRequest)session-store.tsβ Session store update logic (updateSessionStoreAfterAgentRun)delivery.tsβ Agent command delivery logic (deliverAgentCommandResult)agent-command.tsβ Agent command execution pipeline (prepareAgentCommandExecution,agentCommandInternal)tasks/manifest.yamltask_iterative_code_refineandtask_session_chain_analysisto the task listtests/test_multi_session.py(new)TestMultiSessionFrontmatterParsingβ sessions list, new_session flag, string entriesTestNewSessionHandlingβ archive-before-cleanup behavior verificationTestArchiveTranscriptβ transcript archiving with/without transcript pathTestMultiSessionTaskLoadingβ integration test verifying task_second_brain and task_iterative_code_refine load correctlyMotivation
Real-world AI coding agents are used in multi-turn conversations. Users:
PinchBench should test all of these scenarios. The existing
task_second_brain.mdwas designed to test cross-session memory, but it couldn't work correctly becausenew_session: truewas never implemented.Testing
Backward Compatibility
new_session: No change in behavior (all prompts share one session, same as before)new_session: true: Now correctly isolates sessions. Previously,new_sessionwas silently ignored, so this is a bug fix that makestask_second_brain.mdwork as originally intended.