Skip to content

feat: implement new_session:true support for multi-turn task session …#330

Open
HaotianChen616 wants to merge 1 commit intopinchbench:mainfrom
HaotianChen616:feat/multi-session-new-session-support
Open

feat: implement new_session:true support for multi-turn task session …#330
HaotianChen616 wants to merge 1 commit intopinchbench:mainfrom
HaotianChen616:feat/multi-session-new-session-support

Conversation

@HaotianChen616
Copy link
Copy Markdown

feat: implement full multi-turn user prompt support with session isolation

Summary

PinchBench currently has partial multi-session support β€” the sessions field in task frontmatter is processed to send sequential prompts, but the new_session: true flag (already defined in task_second_brain.md) is completely ignored by the Python codebase. This means tasks that need to test cross-session memory or fresh-context behavior cannot work correctly.

This PR implements full multi-turn user prompt support by:

  1. Implementing new_session: true session isolation in lib_agent.py β€” when a session entry has new_session: true, the agent's session state is cleaned up and a new session_id is generated, simulating a user returning after closing the agent. The workspace (and any files created) is preserved across sessions.

  2. Adding transcript archiving per session β€” before starting a new session, the current session's transcript is archived. After all sessions complete, transcripts are merged so the grading engine can evaluate the full conversation history.

  3. Documenting multi-session tasks in TASK_TEMPLATE.md β€” adds a complete "Multi-Session Tasks" section with field descriptions, usage guidelines, and YAML examples.

  4. Adding two new multi-turn tasks:

    • task_iterative_code_refine.md β€” tests iterative code refinement across 3 sessions, with the final session using new_session: true to verify the agent can work from file-based context alone
    • task_session_chain_analysis.md β€” tests 4-turn structured code analysis (single-session multi-turn) where the agent reads TypeScript source files, produces JSON chain maps, designs minimal changes, extracts code evidence, and writes a delivery summary
  5. Adding tests β€” test_multi_session.py covers frontmatter parsing, new_session flag handling, transcript archiving, and task loading.

Changes

scripts/lib_agent.py

  • Added _archive_transcript() helper to save per-session transcripts before session resets
  • Modified execute_openclaw_task() multi-session loop to:
    • Track current_session_id separately (changes when new_session: true)
    • Call cleanup_agent_sessions() and generate a new session ID when new_session: true
    • Log session transitions clearly
  • After all sessions complete, merge archived transcripts with the final session transcript when new_session was used

tasks/TASK_TEMPLATE.md

  • Added sessions and new_session YAML examples to the frontmatter template
  • Added "Multi-Session Tasks" documentation section with field table, how-it-works explanation, and usage guidelines
  • Added multi-session checklist items

tasks/task_iterative_code_refine.md (new)

  • 3-session task: initial implementation β†’ add error handling β†’ fresh-session review
  • Tests iterative refinement and cross-session file-based context retrieval
  • Automated grading checks both calculator.py and review.txt

tasks/task_session_chain_analysis.md (new)

  • 4-session multi-turn task (single conversation context, no new_session): structured ingest β†’ minimal design β†’ evidence extraction β†’ delivery summary
  • Source files from the OpenClaw agent command pipeline (session.ts, session-store.ts, delivery.ts, agent-command.ts) copied to tasks/assets/session_chain/
  • Tests the agent's ability to maintain context across turns, produce structured JSON, extract precise code evidence, and generate traceable design review artifacts
  • Automated grading checks chain_map stages, evidence_index, function references, JSON validity, and traceability

tasks/assets/session_chain/ (new)

  • session.ts β€” OpenClaw session resolution logic (resolveSession, resolveSessionKeyForRequest)
  • session-store.ts β€” Session store update logic (updateSessionStoreAfterAgentRun)
  • delivery.ts β€” Agent command delivery logic (deliverAgentCommandResult)
  • agent-command.ts β€” Agent command execution pipeline (prepareAgentCommandExecution, agentCommandInternal)

tasks/manifest.yaml

  • Added task_iterative_code_refine and task_session_chain_analysis to the task list

tests/test_multi_session.py (new)

  • TestMultiSessionFrontmatterParsing β€” sessions list, new_session flag, string entries
  • TestNewSessionHandling β€” archive-before-cleanup behavior verification
  • TestArchiveTranscript β€” transcript archiving with/without transcript path
  • TestMultiSessionTaskLoading β€” integration test verifying task_second_brain and task_iterative_code_refine load correctly

Motivation

Real-world AI coding agents are used in multi-turn conversations. Users:

  • Ask agents to create something, then refine it iteratively
  • Return to agents in new sessions expecting them to pick up where they left off using files/context
  • Need agents that can work with both conversation history AND file-based persistence

PinchBench should test all of these scenarios. The existing task_second_brain.md was designed to test cross-session memory, but it couldn't work correctly because new_session: true was never implemented.

Testing

# Run the new tests
cd /path/to/skill
uv run pytest tests/test_multi_session.py -v

# Lint check
ruff check scripts/lib_agent.py

Backward Compatibility

  • Single-session tasks: No change in behavior
  • Multi-session tasks without new_session: No change in behavior (all prompts share one session, same as before)
  • Multi-session tasks with new_session: true: Now correctly isolates sessions. Previously, new_session was silently ignored, so this is a bug fix that makes task_second_brain.md work as originally intended.

…isolation

- Implement new_session: true in lib_agent.py: when a session entry has
  new_session: true, archive the current transcript, clean up the agent's
  session state, and generate a new session_id so the agent starts with
  no conversation history (workspace files are preserved)
- Add _archive_transcript() helper for per-session transcript preservation
  before session resets
- Merge all archived session transcripts after all sessions complete so
  the grading engine can inspect the full conversation history
- Document sessions and new_session fields in TASK_TEMPLATE.md with a
  dedicated Multi-Session Tasks section and updated author checklist
- Add task_iterative_code_refine: a 3-session iterative refinement task
  demonstrating multi-turn conversation and new_session isolation
- Add test_multi_session.py with 11 tests covering frontmatter parsing,
  new_session flag detection, transcript archiving, and integration
@HaotianChen616
Copy link
Copy Markdown
Author

#329

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Apr 15, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

This PR is well-structured and correctly implements new_session: true session isolation. The transcript archiving/merging logic handles multiple new_session boundaries correctly β€” each archive captures the cumulative conversation up to that boundary, and the merge loop reassembles the full history in order. Error handling at each I/O step is present and appropriate.

Files Reviewed (6 files)
  • scripts/lib_agent.py β€” core multi-session logic
  • tasks/TASK_TEMPLATE.md β€” documentation additions
  • tasks/task_iterative_code_refine.md β€” new task
  • tasks/task_session_chain_analysis.md β€” new task
  • tasks/assets/session_chain/agent-command.ts β€” task asset
  • tests/test_multi_session.py β€” new tests

Reviewed by claude-4.6-sonnet-20260217 Β· 153,273 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant