Skip to content

feat(gastown): add debug replay-events endpoint for reconciler phase 5#1373

Merged
jrf0110 merged 8 commits intomainfrom
convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/head
Mar 24, 2026
Merged

feat(gastown): add debug replay-events endpoint for reconciler phase 5#1373
jrf0110 merged 8 commits intomainfrom
convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/head

Conversation

@jrf0110
Copy link
Copy Markdown
Contributor

@jrf0110 jrf0110 commented Mar 21, 2026

Summary

Adds a POST /debug/towns/:townId/replay-events endpoint that replays town events from a given time range for debugging purposes. The endpoint:

  • Accepts from/to ISO timestamps, queries all town_events in that range (regardless of processed_at)
  • Applies each event via reconciler.applyEvent() to reconstruct state transitions
  • Runs reconciler.reconcile() against the resulting state to compute what actions would be emitted
  • Captures agent and non-terminal bead snapshots
  • Rolls back all mutations via SQLite SAVEPOINT so the endpoint is fully side-effect-free

Also includes the preceding commits on this convoy branch: dry-run reconciler endpoint, debug dry-run with event draining, and a fix for skipping container_status events.

Verification

  • Code review: all patterns match existing debugDryRun endpoint conventions (SAVEPOINT/ROLLBACK, parameterized queries, Zod validation, eslint-disable comments)
  • Imports verified: town_events, TownEventRecord, reconciler, query, Action all correctly imported
  • SQL injection safe: user inputs passed as parameterized ? placeholders
  • Input validation: missing fields (400), invalid dates (400), reversed range (400)

Visual Changes

N/A

Reviewer Notes

  • This is an unauthenticated /debug/ route, consistent with existing debug endpoints marked for removal after debugging
  • Unlike debugDryRun, this endpoint does NOT call events.markProcessed() — this is intentional since it replays historical (already-processed) events rather than draining pending ones
  • The SAVEPOINT pattern (SAVEPOINT → try/finally → ROLLBACK TORELEASE) is identical to the existing debugDryRun method

jrf0110 and others added 4 commits March 21, 2026 11:15
Filter out 'running' status in the alarm pre-phase before calling
upsertContainerStatus(). Running is the steady-state for healthy agents
and a no-op in applyEvent(), so recording it just bloats the event table
(~720 events/hour/agent). Non-running statuses (stopped, error, unknown)
still get inserted for reconciler detection.
Add a debug endpoint that runs the reconciler against current live state
and returns the actions it would emit without applying them. This enables
inspecting what the reconciler thinks should happen at any given moment.

- Add debugDryRun() method to TownDO that calls reconciler.reconcile()
  and returns actions + metrics without calling applyAction()
- Add POST /debug/towns/:townId/reconcile-dry-run route following the
  same unauthenticated debug pattern as GET /debug/towns/:townId/status
- Response includes actions array, actionsEmitted count, actionsByType
  breakdown, and pendingEventCount
* feat(claw): evaluate button-vs-card feature flag for PostHog experiment tracking

* fix(claw): move button-vs-card flag eval to CreateInstanceCard

Moves useFeatureFlagVariantKey('button-vs-card') from ClawDashboard
(which renders for all users including those with existing instances)
to CreateInstanceCard (which only renders for users who haven't
provisioned yet). This scopes the experiment exposure to users who
can actually see the create CTA, avoiding population dilution.

* feat(gastown): add POST /debug/reconcile-dry-run endpoint

Add a debug endpoint that runs the reconciler against current live state
and returns the actions it would emit without applying them. This enables
inspecting what the reconciler thinks should happen at any given moment.

- Add debugDryRun() method to TownDO that calls reconciler.reconcile()
  and returns actions + metrics without calling applyAction()
- Add POST /debug/towns/:townId/reconcile-dry-run route following the
  same unauthenticated debug pattern as GET /debug/towns/:townId/status
- Response includes actions array, actionsEmitted count, actionsByType
  breakdown, and pendingEventCount

* fix(gastown): drain pending events in debugDryRun() before reconciling

Wrap debugDryRun() in a SQLite savepoint so it can drain and apply
pending town_events (Phase 0) before running reconcile (Phase 1),
matching the real alarm loop behavior. The savepoint is rolled back
in a finally block so the endpoint remains fully side-effect-free.

Adds eventsDrained to the returned metrics.

---------

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>
Co-authored-by: Pedro Heyerdahl <pedro@kilocode.ai>
Co-authored-by: Pedro Heyerdahl <61753986+pedroheyerdahl@users.noreply.github.com>
…y debugging

Adds debugReplayEvents(from, to) method to Town.do.ts that queries all
town_events in a time range (regardless of processed_at), applies them
to reconstruct state transitions, runs the reconciler, and returns the
computed actions and a state snapshot. Uses a SQLite SAVEPOINT that is
rolled back so the endpoint remains fully side-effect-free.

Route: POST /debug/towns/:townId/replay-events
Body: { from: ISO, to: ISO }
Response: { eventsReplayed, actions, stateSnapshot }
Comment thread cloudflare-gastown/src/dos/Town.do.ts
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Mar 21, 2026

Code Review Summary

Status: 2 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 2
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
cloudflare-gastown/src/dos/Town.do.ts 3049 Reporting every invariant violation to Sentry on each 5s alarm tick can flood duplicate events and exhaust quota.
Other Observations (not in diff)

Issues found in unchanged code that cannot receive inline comments:

File Line Issue
cloudflare-gastown/src/dos/Town.do.ts 3831 debugReplayEvents() still re-applies historical events on top of live state, so non-idempotent handlers can return approximate actions and snapshots instead of a faithful historical reconstruction.
Files Reviewed (1 files)
  • cloudflare-gastown/src/dos/Town.do.ts - 2 issues

Reviewed by gpt-5.4-20260305 · 455,888 tokens

…afana dashboard panels (#1372)

- Extend writeEvent() to support double3-double10 fields for reconciler metrics
- Emit reconciler_tick event after each alarm tick with all 9 metrics
- Add Reconciler row to Grafana dashboard with 6 panels:
  1. Events drained per tick (timeseries)
  2. Actions emitted per tick by type (stacked bar)
  3. Side effects attempted/succeeded/failed (timeseries)
  4. Invariant violations (stat with >0 alert threshold)
  5. Reconciler wall clock time (timeseries with >500ms threshold)
  6. Pending event queue depth (gauge with >50 threshold)
Comment thread cloudflare-gastown/gastown-grafana-dash-1.json Outdated
jrf0110 and others added 2 commits March 23, 2026 10:54
…query

Add a caveat comment and response field to debugReplayEvents explaining
that events are re-applied on top of live state, not from a pre-window
snapshot — results are approximate, useful for debugging event flow but
not faithful historical reconstruction.

Fix the Grafana 'Pending Event Queue Depth' gauge to show the latest
row's double8 value instead of averaging across the time window.
Each invariant violation now triggers Sentry.captureMessage with structured
context (invariant number, message, townId) as both extra data and tags.
Existing analytics event emission is preserved. Added TODO for future
auto-recovery of invariant #7 (working agent with no hook).
});

for (const violation of violations) {
Sentry.captureMessage(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Persistent invariants will flood Sentry on every alarm tick

alarm() runs on the 5s active interval, so a long-lived invariant now calls Sentry.captureMessage() every tick until the town goes idle. One stuck town can easily generate ~17k duplicate Sentry events per day, which will drown the real signal and burn quota. Please dedupe or rate-limit these reports (for example, only emit when the invariant set changes) before sending them to Sentry.

@jrf0110 jrf0110 merged commit 62c7fe2 into main Mar 24, 2026
19 checks passed
@jrf0110 jrf0110 deleted the convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/head branch March 24, 2026 16:58
@jrf0110
Copy link
Copy Markdown
Contributor Author

jrf0110 commented Apr 6, 2026

Refinery code review passed. All quality gates pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants