feat: implement new_session:true support for multi-turn task session … by HaotianChen616 · Pull Request #330 · pinchbench/skill

HaotianChen616 · 2026-04-15T08:10:41Z

feat: implement full multi-turn user prompt support with session isolation

Summary

PinchBench currently has partial multi-session support — the sessions field in task frontmatter is processed to send sequential prompts, but the new_session: true flag (already defined in task_second_brain.md) is completely ignored by the Python codebase. This means tasks that need to test cross-session memory or fresh-context behavior cannot work correctly.

This PR implements full multi-turn user prompt support by:

Implementing new_session: true session isolation in lib_agent.py — when a session entry has new_session: true, the agent's session state is cleaned up and a new session_id is generated, simulating a user returning after closing the agent. The workspace (and any files created) is preserved across sessions.
Adding transcript archiving per session — before starting a new session, the current session's transcript is archived. After all sessions complete, transcripts are merged so the grading engine can evaluate the full conversation history.
Documenting multi-session tasks in TASK_TEMPLATE.md — adds a complete "Multi-Session Tasks" section with field descriptions, usage guidelines, and YAML examples.
Adding two new multi-turn tasks:
- task_iterative_code_refine.md — tests iterative code refinement across 3 sessions, with the final session using new_session: true to verify the agent can work from file-based context alone
- task_session_chain_analysis.md — tests 4-turn structured code analysis (single-session multi-turn) where the agent reads TypeScript source files, produces JSON chain maps, designs minimal changes, extracts code evidence, and writes a delivery summary
Adding tests — test_multi_session.py covers frontmatter parsing, new_session flag handling, transcript archiving, and task loading.

Changes

`scripts/lib_agent.py`

Added _archive_transcript() helper to save per-session transcripts before session resets
Modified execute_openclaw_task() multi-session loop to:
- Track current_session_id separately (changes when new_session: true)
- Call cleanup_agent_sessions() and generate a new session ID when new_session: true
- Log session transitions clearly
After all sessions complete, merge archived transcripts with the final session transcript when new_session was used

`tasks/TASK_TEMPLATE.md`

Added sessions and new_session YAML examples to the frontmatter template
Added "Multi-Session Tasks" documentation section with field table, how-it-works explanation, and usage guidelines
Added multi-session checklist items

`tasks/task_iterative_code_refine.md` (new)

3-session task: initial implementation → add error handling → fresh-session review
Tests iterative refinement and cross-session file-based context retrieval
Automated grading checks both calculator.py and review.txt

`tasks/task_session_chain_analysis.md` (new)

4-session multi-turn task (single conversation context, no new_session): structured ingest → minimal design → evidence extraction → delivery summary
Source files from the OpenClaw agent command pipeline (session.ts, session-store.ts, delivery.ts, agent-command.ts) copied to tasks/assets/session_chain/
Tests the agent's ability to maintain context across turns, produce structured JSON, extract precise code evidence, and generate traceable design review artifacts
Automated grading checks chain_map stages, evidence_index, function references, JSON validity, and traceability

`tasks/assets/session_chain/` (new)

session.ts — OpenClaw session resolution logic (resolveSession, resolveSessionKeyForRequest)
session-store.ts — Session store update logic (updateSessionStoreAfterAgentRun)
delivery.ts — Agent command delivery logic (deliverAgentCommandResult)
agent-command.ts — Agent command execution pipeline (prepareAgentCommandExecution, agentCommandInternal)

`tasks/manifest.yaml`

Added task_iterative_code_refine and task_session_chain_analysis to the task list

`tests/test_multi_session.py` (new)

TestMultiSessionFrontmatterParsing — sessions list, new_session flag, string entries
TestNewSessionHandling — archive-before-cleanup behavior verification
TestArchiveTranscript — transcript archiving with/without transcript path
TestMultiSessionTaskLoading — integration test verifying task_second_brain and task_iterative_code_refine load correctly

Motivation

Real-world AI coding agents are used in multi-turn conversations. Users:

Ask agents to create something, then refine it iteratively
Return to agents in new sessions expecting them to pick up where they left off using files/context
Need agents that can work with both conversation history AND file-based persistence

PinchBench should test all of these scenarios. The existing task_second_brain.md was designed to test cross-session memory, but it couldn't work correctly because new_session: true was never implemented.

Testing

# Run the new tests
cd /path/to/skill
uv run pytest tests/test_multi_session.py -v

# Lint check
ruff check scripts/lib_agent.py

Backward Compatibility

Single-session tasks: No change in behavior
Multi-session tasks without new_session: No change in behavior (all prompts share one session, same as before)
Multi-session tasks with new_session: true: Now correctly isolates sessions. Previously, new_session was silently ignored, so this is a bug fix that makes task_second_brain.md work as originally intended.

…isolation - Implement new_session: true in lib_agent.py: when a session entry has new_session: true, archive the current transcript, clean up the agent's session state, and generate a new session_id so the agent starts with no conversation history (workspace files are preserved) - Add _archive_transcript() helper for per-session transcript preservation before session resets - Merge all archived session transcripts after all sessions complete so the grading engine can inspect the full conversation history - Document sessions and new_session fields in TASK_TEMPLATE.md with a dedicated Multi-Session Tasks section and updated author checklist - Add task_iterative_code_refine: a 3-session iterative refinement task demonstrating multi-turn conversation and new_session isolation - Add test_multi_session.py with 11 tests covering frontmatter parsing, new_session flag detection, transcript archiving, and integration

HaotianChen616 · 2026-04-15T08:12:31Z

#329

kilo-code-bot · 2026-04-15T08:12:54Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

This PR is well-structured and correctly implements new_session: true session isolation. The transcript archiving/merging logic handles multiple new_session boundaries correctly — each archive captures the cumulative conversation up to that boundary, and the merge loop reassembles the full history in order. Error handling at each I/O step is present and appropriate.

Files Reviewed (6 files)

scripts/lib_agent.py — core multi-session logic
tasks/TASK_TEMPLATE.md — documentation additions
tasks/task_iterative_code_refine.md — new task
tasks/task_session_chain_analysis.md — new task
tasks/assets/session_chain/agent-command.ts — task asset
tests/test_multi_session.py — new tests

_{Reviewed by claude-4.6-sonnet-20260217 · 153,273 tokens}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement new_session:true support for multi-turn task session …#330

feat: implement new_session:true support for multi-turn task session …#330
HaotianChen616 wants to merge 1 commit intopinchbench:mainfrom
HaotianChen616:feat/multi-session-new-session-support

HaotianChen616 commented Apr 15, 2026

Uh oh!

HaotianChen616 commented Apr 15, 2026

Uh oh!

kilo-code-bot bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HaotianChen616 commented Apr 15, 2026

feat: implement full multi-turn user prompt support with session isolation

Summary

Changes

scripts/lib_agent.py

tasks/TASK_TEMPLATE.md

tasks/task_iterative_code_refine.md (new)

tasks/task_session_chain_analysis.md (new)

tasks/assets/session_chain/ (new)

tasks/manifest.yaml

tests/test_multi_session.py (new)

Motivation

Testing

Backward Compatibility

Uh oh!

HaotianChen616 commented Apr 15, 2026

Uh oh!

kilo-code-bot bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`scripts/lib_agent.py`

`tasks/TASK_TEMPLATE.md`

`tasks/task_iterative_code_refine.md` (new)

`tasks/task_session_chain_analysis.md` (new)

`tasks/assets/session_chain/` (new)

`tasks/manifest.yaml`

`tests/test_multi_session.py` (new)

kilo-code-bot bot commented Apr 15, 2026 •

edited

Loading