feat: add Maestro replay compatibility by thymikee · Pull Request #581 · callstackincubator/agent-device

thymikee · 2026-05-21T19:34:39Z

Summary

Adds a focused Maestro replay compatibility path for existing YAML flows, including command mapping, variable/runScript handling, flow-control shims, and replay-only runtime helpers for scroll/tap behavior. Also adds the iOS non-hittable selector tap backdoor needed for hidden RN E2E controls and documents the supported replay subset.

Updates #558.

Touched 34 files. Scope is replay compatibility plus the narrow iOS runner tap fallback required by that workflow. Device utility commands such as permissions, mock location, airplane mode, orientation, and recording are intentionally left unsupported for now.

Validation

Verified with pnpm format, focused replay/help tests, pnpm build, and git diff --check. pnpm check:unit outside the sandbox passed all but one unrelated Android .aab install test that timed out in the full run; rerunning that exact test passed. Earlier validation on this branch also passed pnpm build:xcuitest for iOS and macOS. Manual Bluesky replay comparison produced Agent Device and Maestro recordings during validation; Maestro still fails the selected Bluesky flow earlier than Agent Device because its precondition does not reach the target screen.

github-actions · 2026-05-21T19:36:10Z

PR Preview Action v1.8.1
🚀 View preview at https://callstackincubator.github.io/agent-device/pr-preview/pr-581/
Built to branch `gh-pages` at 2026-05-26 07:07 UTC. Preview will be ready when the GitHub Pages deployment is complete.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6794225256

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

thymikee · 2026-05-22T08:01:03Z

Code review

Overview

This reworks the Maestro replay compatibility layer: it removes the previously-mapped device/utility commands (setPermissions, setLocation, setOrientation, setAirplaneMode, startRecording/stopRecording, killApp, literal assertTrue) and adds a runtime command layer (__maestro* shims dispatched at replay time), scrollUntilVisible, percentage point taps, direction-based swipe, runFlow.when.visible/notVisible, runScript (JS execution + http.post/json/output), launchApp launch arguments + iOS-sim clearState, and an iOS "non-hittable selector tap" backdoor for RN E2E controls. The split into flow-control.ts / points.ts / run-script.ts / runtime-commands.ts / session-replay-maestro-runtime.ts is clean and the runtime tests are substantial.

Findings

Security / decisions to make consciously

S1 — runScript executes arbitrary JS + network requests at parse time, via vm (not a sandbox). (src/compat/maestro/run-script.ts, executeRunScript)

executeRunScript runs the referenced .js through vm.runInNewContext(...) and intentionally exposes json() and http.post. Node's docs are explicit that vm is not a security mechanism, so this should not be described as sandboxed.
It runs during parse/convert (it takes a MaestroParseContext and writes results into context.env synchronously), not when the step is reached during replay. Combined with runFlow: <file> (recursive parse) and repeat, merely parsing a flow can execute JS and fire http.post to arbitrary URLs before any device action — and a runScript under repeat.times: N executes N×.
Recommendation: (a) document honestly that runScript runs untrusted user JS with full Node + network capability and is not sandboxed; (b) consider deferring execution to replay-runtime so a dry parse/validate is side-effect-free; (c) consider an explicit opt-in flag. Path handling itself (absolute or baseDir-relative only) is fine.

S2 — iOS non-hittable selector tap: bounds check is looser than its comment. (RunnerTests+Interaction.swift, hasTappableFrame)
hasTappableFrame accepts when appFrame.isEmpty || appFrame.intersects(frame), then taps at frame.midX/midY. A partially-intersecting element (mostly off-screen) can be tapped at a point outside the visible app frame. It's gated behind the replay-only allowNonHittableSelectorTap flag (set only by tapOn), so blast radius is limited, but consider clamping the tap point into the app frame, or requiring the center (not just any intersection) to be inside it.

Issues to verify

I1 — Android launchArgs are silently dropped. (src/core/interactors/android.ts, open)
convertLaunchApp maps Maestro arguments/launchArguments generically (no platform gate), but the Android interactor's open forwards only options?.activity — launchArgs are ignored. A cross-platform flow using launchApp.arguments on Android appears to succeed while the arguments vanish. Either wire launch args through openAndroidApp or fail loudly / document them as iOS-only (mirroring how clearState is explicitly iOS-sim-only).

I2 — clearState messaging is self-contradictory; verify the iOS data-wipe is correct. (clearIosSimulatorAppState in src/platforms/ios/apps.ts, dispatched from dispatch.ts)
The iOS-sim path resolves simctl get_app_container ... data and fs.rms every entry, then relaunches without reinstall. Worth confirming this reliably produces clean state and removes nothing simctl launch expects. Separately, the help/docs text still says the subset works "without state-reset side effects" while clearState now does reset iOS-sim state — align the wording.

I3 — pressEnter "fetch failed" recovery can report a non-submit as success. (session-replay-maestro-runtime.ts)
On a type ['\n'] failure whose message contains fetch failed, it issues a snapshot and, if that succeeds, returns { ok: true, recovered: true }. Success is inferred from transport recovery, not from any UI state change, and keys off substring-matching the error text (brittle — prefer an error code). Reasonable as a pragmatic shim, but worth a comment on the risk.

I4 — runHttpRequestSync blindly JSON.parse(result.stdout). (run-script.ts)
If the spawned node -e process times out or emits non-JSON, JSON.parse throws an opaque Unexpected token and the child's stderr is discarded. Check exitCode and include stderr in the surfaced error.

I5 — runFlow.when round-trip + fractional trace-step keys. (flow-control.ts sessionActionToBatchStep, runtime batchStepToSessionAction)
when-gated blocks are wrapped into batchSteps and re-expanded via invokeReplayAction. Nested __maestro* runtime commands inside a when block do appear to re-dispatch correctly, but there's no test for a runtime command nested inside when.visible (only tapOn→find). Also the step + index/1000 trace-step scheme will collide if a when block holds ≥1000 steps — astronomically unlikely, but worth a comment.

Nits

invokeMaestroScrollUntilVisible doesn't validate the direction positional at runtime (only the map form is validated at parse). Internal-only, low risk.
extractMaestroVisibleTextQuery returns null (disables fuzzy matching) for any non-pure label/text/id-equal selector — correct, but a one-line comment on why mixed selectors opt out would help.
tapFlags gained maestroOptional but doubleTap/longPress don't, though Maestro allows optional on those too — noting the asymmetry, not in scope.
Removing setPermissions/startRecording etc. breaks any external examples referencing them; make sure the Track Maestro flow compatibility for replay --maestro #558 supported matrix is updated (in-repo replay-e2e.md is).

Test coverage & CI

Coverage of the new mapping/runtime surface is good: replay-flow.test.ts covers tapOn→__maestroTapOn (+ the allowNonHittableSelectorTap flag only on selector taps), percentage point taps, scrollUntilVisible, direction swipe, openLink appId prefixing, launchApp args + clearState, and removed-command rejections. session-replay-vars.test.ts covers scrollUntilVisible probe/scroll/timeout, fuzzy tapOn, optional native-label fallback, percentage point resolution, tapOn retry, pressEnter transport recovery, and runFlow.when.visible run/skip.

Gaps worth filling:

runScript: only one happy-path mapping test (output → type, exercising json() + ${output.x}). No test for http.post, none for a throwing/failing script, none asserting the parse-time/non-sandboxed behavior.
Android launchArgs drop (I1) — untested.
clearIosSimulatorAppState (I2) — untested.
iOS non-hittable fallback has no Swift-level test; hasTappableFrame partial-intersection (S2) is untested (only the TS handler forwarding allowNonHittableTap is asserted).

CI: not failing. All completed checks pass; the unstable/pending state at review time was due to two still-in-progress checks (a Smoke Tests job and "Analyze (swift)" CodeQL) plus the umbrella CodeQL run being neutral (informational), not a red check.

Summary

Solid, well-factored compatibility work with strong runtime tests. The item warranting an explicit maintainer decision is runScript (S1) — it runs arbitrary user JS with network access at parse time and shouldn't be called sandboxed; consider deferring to runtime and/or an opt-in flag, and document the capability honestly. The most concrete functional bug is Android launchArgs being silently dropped (I1), and the "no state-reset side effects" wording is now inconsistent with clearState (I2). None of these block the approach; they're about hardening behavior and aligning the documented contract.

_{Reviewed with Claude Code.}

Generated by Claude Code

thymikee

Took a closer look at the Maestro replay compatibility surface. Overall the structure looks reasonable — the parse-time → runtime-command split for tapOn/scrollUntilVisible/runFlow.when/pressEnter is the right factoring, and the test coverage on replay-flow.test.ts and session-replay-vars.test.ts is substantial.

Spot-checked the two Codex P1s from earlier:

scrollUntilVisible direction inversion: fixed by introducing the separate readScrollUntilVisibleDirection (left a note suggesting a docblock so the asymmetry with readScrollPositionalsFromDirectionSwipe is obvious to future readers).
runFlow.when swallowing errors: narrowed, but the heuristic in isMaestroWhenConditionMiss still classifies most COMMAND_FAILED outcomes from is as "condition missed" → silently skip subflow. Worth a small refactor on the is handler side to make pass/fail/error distinct.

Other observations I'd like to see addressed before merge:

vm.runInNewContext is not a security boundary and its timeout doesn't bound async work. Both should be documented in code + replay-e2e.md so the trust model is explicit (this is internal-flows-only, right?).
pressEnter fetch-failed recovery matches by substring and treats a successful follow-up snapshot as "Enter worked". Worth tightening both the match condition and the response shape so upstream is predicates can't read the recovered state as success.
MAX_REPEAT_EXPANSIONS = 100 is fine to ship but parse-time expansion will be painful to back out of if real flows want larger repeats. A runtime-command form (like runFlowWhen) would scale better.
http.post spawns a Node child per call — fine for one-off auth setup, will be slow if anyone actually loops on it.

Nothing in here looks like a hard block — most of the concerns are around tightening trust boundaries and getting the is-handler error taxonomy right so runFlow.when doesn't drift back toward "errors look like misses".

Generated by Claude Code

thymikee · 2026-05-26T06:24:24Z

+      timeout: RUN_SCRIPT_TIMEOUT_MS,
+    });
+  } catch (error) {
+    throw new AppError(


A few notes on vm.runInNewContext here that I think deserve being called out in code and/or in replay-e2e.md:

vm is not a sandbox. The Node docs are explicit: "The node:vm module is not a security mechanism. Do not use it to run untrusted code." Script code can reach back into the host process via prototype chains, getter side effects, etc. Given that Maestro replay flows usually come from files in the repo, this is probably an acceptable trust model — but it should be documented, because someone might reasonably assume "we sandbox runScript". A one-line comment near the call site and a sentence in replay-e2e.md would prevent future surprise.

The timeout option only bounds synchronous execution. As soon as the script does anything async (Promise, http.post returns, callback timers, etc.), the 30 s timeout doesn't help. In practice this means a long-running fetch in user code won't be aborted by the vm.runInNewContext timeout — only by the child-process timeout inside runHttpRequestSync. Worth noting in a comment so readers don't assume "everything is bounded by 30 s".

Output keys with dots end up as env keys with dots (context.env[\output.${key}`] = ...). If a script does output["bar.baz"] = 1, the resulting key output.bar.bazis ambiguous w.r.t. Maestro's nested variable resolution if it ever lands. Probably fine for now, but consider rejecting.` in output keys to keep the namespace clean.

Generated by Claude Code

thymikee · 2026-05-26T06:24:35Z

+      ok: true,
+      data: { skipped: true, condition: condition.mode, selector: condition.selector },
+    };
+  }


This heuristic is the narrowed version of the Codex bot's earlier P1 (runFlow.when swallowed runtime errors as condition misses). It's better than the original — the android_foreground_surface carve-out is good — but the underlying conflation is still there: any COMMAND_FAILED from the is handler (e.g. snapshot transport failure, runner disconnect, selector parse error, timeout) is interpreted as "the condition did not match" and the subflow is silently skipped.

The cleanest fix would be for the is handler to use distinct error codes (or include details.predicate = 'pass' | 'fail' | 'error') so this check can pivot on code === 'PREDICATE_FAILED' (or details.predicate === 'fail') instead of code === 'COMMAND_FAILED'. That makes the behavior intentional and exhaustive instead of "everything looks like a miss unless we know otherwise".

Short of that refactor, I'd at least make the carve-out an allowlist of "things that mean the condition was evaluated and was false" rather than a denylist of "things that don't count as a miss" — denylists drift over time as new failure modes get added to the is path.

Generated by Claude Code

thymikee · 2026-05-26T06:24:48Z

+    predicate: mode === 'visible' ? 'visible' : 'hidden',
+    selector,
+  };
+}


Two concerns with the "Enter caused transport reset, snapshot succeeded → treat as recovered" path:

message.includes('fetch failed') matches anywhere in the string, including substrings of unrelated errors that happen to mention fetch failure. If you also start surfacing other error messages that contain "fetch failed" (e.g. an HTTP plugin or a future replay backend), this will accidentally extend the recovery path to them. Matching on error.code plus a specific transport reason string would be more durable.

A successful follow-up snapshot doesn't prove that the original type "\n" actually submitted the form — only that the runner is alive again. The warning in the response acknowledges this, but the caller sees { ok: true, recovered: true, ... } and will probably proceed as if Enter worked. For replay specifically that's often fine (the next step will fail loudly if it didn't), but for runFlow.when evaluation upstream, an ok: true here is indistinguishable from a real success. Consider whether the recovered case should be propagated as a distinct outcome (e.g. { ok: true, data: { recovered: true } } should never feed back into is predicates as "pass").

Generated by Claude Code

thymikee · 2026-05-26T06:24:57Z

@@ -136,10 +181,39 @@ export function convertSwipe(value: unknown): SessionAction {
  );
 }



Note for future readers: readScrollUntilVisibleDirection passes the direction through unchanged (up → up), while readScrollPositionalsFromDirectionSwipe above inverts it (up → down). Both behaviors are correct given Maestro semantics — swipe.direction describes finger motion, scrollUntilVisible.direction describes scroll/content motion — but the asymmetry inside one file is genuinely confusing.

A short docblock on each function ("Maestro swipe direction describes finger motion, so we invert to get scroll direction" / "Maestro scrollUntilVisible.direction already describes scroll direction") would save the next reader a few minutes of "wait, why are these different?".

This also looks like the fix for the Codex bot's resolved P1 about scroll direction inversion in scrollUntilVisible — the previous code reused readScrollPositionalsFromDirectionSwipe here, which inverted incorrectly. The new function is the right shape.

Generated by Claude Code

thymikee · 2026-05-26T06:25:09Z

+const MAX_REPEAT_EXPANSIONS = 100;
+
+type ConvertCommandList = (
+  commands: MaestroCommand[],


MAX_REPEAT_EXPANSIONS = 100 is a fairly aggressive cap given Maestro's typical use of repeat (e.g. scrolling N times to find an item, or polling). 100 is enough for most cases but I wouldn't be surprised to see real-world flows hit this — particularly repeat: { times: ${ITEM_COUNT}, commands: [tap, swipe up] } patterns for long lists.

A couple of options:

Bump to 1000 (still bounds the eager expansion to a sensible memory ceiling).

Or, instead of eagerly expanding times copies of the action list at parse time, emit a single runtime command (runtime: repeat) and have the runtime loop, the same way runFlow.when already does. That removes the cap entirely.

Not blocking, but worth a thought before this lands — bumping the cap later is fine, but the parse-time-expansion pattern can be painful to back out of once flows depend on it.

Generated by Claude Code

thymikee · 2026-05-26T06:25:21Z

+    }),
+    timeoutMs: RUN_SCRIPT_TIMEOUT_MS,
+    allowFailure: true,
+  });


Spawning a fresh Node child process for every http.post is heavy — startup cost is ~50–150 ms per call even on a fast machine, before the request is even sent. For replays that do auth/setup HTTP calls in a script, this is OK; for anything that polls or batches it'll dominate runtime.

Also: this passes the URL and headers/body to a child via JSON stdin. That's fine, but the HTTP_REQUEST_SCRIPT text is hardcoded and uses global fetch, so it implicitly requires Node ≥ 18. A defensive engines constraint or explicit error if typeof fetch === 'undefined' would help users on older Node versions.

Lower priority — fine for v1 of the Maestro replay shim, but if http.post ends up used heavily, an in-process synchronous wrapper around fetch (e.g. via Atomics.wait on a worker, or child_process.execFileSync with the script written to disk once and reused) would pay back quickly.

Generated by Claude Code

thymikee force-pushed the codex/maestro-replay-compat branch from 6794225 to baaee00 Compare May 21, 2026 19:36

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread src/compat/maestro/interactions.ts Outdated

Comment thread src/daemon/handlers/session-replay-maestro-runtime.ts Outdated

thymikee force-pushed the codex/maestro-replay-compat branch from baaee00 to 1ea51f9 Compare May 22, 2026 05:53

thymikee mentioned this pull request May 22, 2026

Track Maestro flow compatibility for replay --maestro #558

Open

thymikee added 3 commits May 25, 2026 22:29

feat: add Maestro replay compatibility

93e61ba

fix: harden Maestro replay compatibility

d1875b3

fix: address Maestro replay review feedback

78be55c

thymikee commented May 26, 2026

View reviewed changes

fix: finalize Maestro replay compatibility

ed3e970

thymikee force-pushed the codex/maestro-replay-compat branch from b4f7170 to ed3e970 Compare May 26, 2026 07:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Maestro replay compatibility#581

feat: add Maestro replay compatibility#581
thymikee wants to merge 4 commits into
mainfrom
codex/maestro-replay-compat

thymikee commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-26 07:07 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

thymikee commented May 22, 2026

Uh oh!

thymikee left a comment

Uh oh!

thymikee May 26, 2026

Uh oh!

thymikee May 26, 2026

Uh oh!

thymikee May 26, 2026

Uh oh!

thymikee May 26, 2026

Uh oh!

thymikee May 26, 2026

Uh oh!

thymikee May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -136,10 +181,39 @@ export function convertSwipe(value: unknown): SessionAction {
		);
		}

Conversation

thymikee commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-05-26 07:07 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

thymikee commented May 22, 2026

Code review

Overview

Findings

Security / decisions to make consciously

Issues to verify

Nits

Test coverage & CI

Summary

Uh oh!

thymikee left a comment

Choose a reason for hiding this comment

Uh oh!

thymikee May 26, 2026

Choose a reason for hiding this comment

Uh oh!

thymikee May 26, 2026

Choose a reason for hiding this comment

Uh oh!

thymikee May 26, 2026

Choose a reason for hiding this comment

Uh oh!

thymikee May 26, 2026

Choose a reason for hiding this comment

Uh oh!

thymikee May 26, 2026

Choose a reason for hiding this comment

Uh oh!

thymikee May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thymikee commented May 21, 2026 •

edited

Loading

github-actions Bot commented May 21, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-26 07:07 UTC.
Preview will be ready when the GitHub Pages deployment is complete.