Skip to content

wait_for_text: two-call state/capture race within a single poll tick #50

@tony

Description

@tony

Type: architecture · Tier: deferred · Tool: wait_for_text

What's happening

Each poll in wait_for_text runs two tmux subprocess calls in sequence:

  1. _read_pane_state issues display-message to read history_size, cursor_y, pane_height, pane_pid, pane_dead.
  2. pane.capture_pane(start=start_line, end=None, join_wrapped=True) issues capture-pane, where start_line = baseline_abs - state.history_size + 1.

Between (1) and (2), tmux can scroll more lines into history. tmux's capture-pane computes top = gd->hsize + n against the live hsize at capture time (cmd-capture-pane.c#L158), not the hsize we sampled in step 1. So when N new rows scroll between the two calls:

  • We pass n = baseline_abs - hsize_at_step1 + 1
  • tmux computes top = hsize_at_step2 + n = baseline_abs + 1 + (hsize_at_step2 - hsize_at_step1)
  • The captured window starts N rows past the row we wanted; those N rows are invisible to the wait this tick.

When it matters

Single-tick latency under bursty output. The next poll usually picks the missed rows back up — unless the missed rows have already scrolled past the visible region and been collected by grid_collect_history, at which point the rollover guard fires and the wait raises. So the bug surface is:

  • One-tick interval of latency on transient bursts (default 50 ms; bounded).
  • Permanent miss only at the moment of history rollover — but rollover now raises.

In other words: the race exists but its impact is bounded by interval and capped at "raise" rather than "silently wrong" thanks to the rollover guard.

Options under consideration

1. Re-read after capture, retry on drift

state_pre = await asyncio.to_thread(_read_pane_state, pane)
start_line = baseline_abs - state_pre.history_size + 1
lines = await asyncio.to_thread(pane.capture_pane, start=start_line, ..., join_wrapped=True)
state_post = await asyncio.to_thread(_read_pane_state, pane)
delta = state_post.history_size - state_pre.history_size
if delta > 0:
    # capture started \`delta\` rows too late; re-issue with adjusted start
    ...

Doubles per-tick subprocess cost in the worst case (3 tmux calls instead of 2 when drift is detected). Complicates the _PaneState invariant set: now we track two state reads per tick. Test matrix grows.

2. Chain in a single tmux command

Build one pane.cmd(...) invocation that issues display-message ; capture-pane with tmux's \; chaining. One stdout stream needs to be split by the caller. Drops out of libtmux's typed API. Tightly couples to tmux's chaining quirks.

3. Document, rely on next-tick recovery (current behavior)

Acceptable because:

  • The miss is bounded by interval (default 50 ms).
  • Permanent misses now raise rather than silently return wrong results, courtesy of the rollover guard.
  • The deterministic alternative for command-completion synchronization is wait_for_channel composed with tmux wait-for -S — zero polling, zero races.

Recommendation

Stay on option 3 until real-world telemetry shows flaky single-tick misses. The blast radius is small and the agent-facing escape hatch (wait_for_channel) is already documented in the wait_for_text "When NOT to use this" section. Re-evaluate if a stress-test fixture starts catching missed transitions.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions