feat(gastown): emit reconciler metrics to Analytics Engine and add Grafana dashboard panels#1372
Merged
jrf0110 merged 1 commit intoconvoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/headfrom Mar 23, 2026
Conversation
…afana dashboard panels - Extend writeEvent() to support double3-double10 fields for reconciler metrics - Emit reconciler_tick event after each alarm tick with all 9 metrics - Add Reconciler row to Grafana dashboard with 6 panels: 1. Events drained per tick (timeseries) 2. Actions emitted per tick by type (stacked bar) 3. Side effects attempted/succeeded/failed (timeseries) 4. Invariant violations (stat with >0 alert threshold) 5. Reconciler wall clock time (timeseries with >500ms threshold) 6. Pending event queue depth (gauge with >50 threshold)
| "interval": "", | ||
| "intervalFactor": 1, | ||
| "nullifySparse": false, | ||
| "query": "SELECT SUM(double8 * _sample_interval) / SUM(_sample_interval) AS pending_events FROM gastown_events WHERE $timeFilter AND blob1 = 'reconciler_tick' ORDER BY timestamp DESC LIMIT 1", |
Contributor
There was a problem hiding this comment.
WARNING: This gauge is showing a time-window average, not the current queue depth
SUM(double8 * _sample_interval) / SUM(_sample_interval) collapses every reconciler_tick in the selected range into a single weighted average, so the panel will not show the latest backlog value. For a queue-depth gauge we need the most recent double8 sample instead, and rawSql should be updated to match.
Contributor
Code Review SummaryStatus: 1 Issues Found | Recommendation: Address before merge Overview
Fix these issues in Kilo Cloud Issue Details (click to expand)WARNING
Other Observations (not in diff)None. Files Reviewed (3 files)
Reviewed by gpt-5.4-20260305 · 550,264 tokens |
c4c6ccb
into
convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/head
2 checks passed
jrf0110
added a commit
that referenced
this pull request
Mar 24, 2026
#1373) * fix: skip container_status events for running containers (#1368) Filter out 'running' status in the alarm pre-phase before calling upsertContainerStatus(). Running is the steady-state for healthy agents and a no-op in applyEvent(), so recording it just bloats the event table (~720 events/hour/agent). Non-running statuses (stopped, error, unknown) still get inserted for reconciler detection. * feat(gastown): add POST /debug/reconcile-dry-run endpoint (#1367) Add a debug endpoint that runs the reconciler against current live state and returns the actions it would emit without applying them. This enables inspecting what the reconciler thinks should happen at any given moment. - Add debugDryRun() method to TownDO that calls reconciler.reconcile() and returns actions + metrics without calling applyAction() - Add POST /debug/towns/:townId/reconcile-dry-run route following the same unauthenticated debug pattern as GET /debug/towns/:townId/status - Response includes actions array, actionsEmitted count, actionsByType breakdown, and pendingEventCount * feat(gastown): add debug dry-run endpoint with event draining (#1370) * feat(claw): evaluate button-vs-card feature flag for PostHog experiment tracking * fix(claw): move button-vs-card flag eval to CreateInstanceCard Moves useFeatureFlagVariantKey('button-vs-card') from ClawDashboard (which renders for all users including those with existing instances) to CreateInstanceCard (which only renders for users who haven't provisioned yet). This scopes the experiment exposure to users who can actually see the create CTA, avoiding population dilution. * feat(gastown): add POST /debug/reconcile-dry-run endpoint Add a debug endpoint that runs the reconciler against current live state and returns the actions it would emit without applying them. This enables inspecting what the reconciler thinks should happen at any given moment. - Add debugDryRun() method to TownDO that calls reconciler.reconcile() and returns actions + metrics without calling applyAction() - Add POST /debug/towns/:townId/reconcile-dry-run route following the same unauthenticated debug pattern as GET /debug/towns/:townId/status - Response includes actions array, actionsEmitted count, actionsByType breakdown, and pendingEventCount * fix(gastown): drain pending events in debugDryRun() before reconciling Wrap debugDryRun() in a SQLite savepoint so it can drain and apply pending town_events (Phase 0) before running reconcile (Phase 1), matching the real alarm loop behavior. The savepoint is rolled back in a finally block so the endpoint remains fully side-effect-free. Adds eventsDrained to the returned metrics. --------- Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com> Co-authored-by: Pedro Heyerdahl <pedro@kilocode.ai> Co-authored-by: Pedro Heyerdahl <61753986+pedroheyerdahl@users.noreply.github.com> * feat(gastown): add POST /debug/replay-events endpoint for event replay debugging Adds debugReplayEvents(from, to) method to Town.do.ts that queries all town_events in a time range (regardless of processed_at), applies them to reconstruct state transitions, runs the reconciler, and returns the computed actions and a state snapshot. Uses a SQLite SAVEPOINT that is rolled back so the endpoint remains fully side-effect-free. Route: POST /debug/towns/:townId/replay-events Body: { from: ISO, to: ISO } Response: { eventsReplayed, actions, stateSnapshot } * feat(gastown): emit reconciler metrics to Analytics Engine and add Grafana dashboard panels (#1372) - Extend writeEvent() to support double3-double10 fields for reconciler metrics - Emit reconciler_tick event after each alarm tick with all 9 metrics - Add Reconciler row to Grafana dashboard with 6 panels: 1. Events drained per tick (timeseries) 2. Actions emitted per tick by type (stacked bar) 3. Side effects attempted/succeeded/failed (timeseries) 4. Invariant violations (stat with >0 alert threshold) 5. Reconciler wall clock time (timeseries with >500ms threshold) 6. Pending event queue depth (gauge with >50 threshold) * fix(gastown): add replay caveat and fix Grafana pending-events gauge query Add a caveat comment and response field to debugReplayEvents explaining that events are re-applied on top of live state, not from a pre-window snapshot — results are approximate, useful for debugging event flow but not faithful historical reconstruction. Fix the Grafana 'Pending Event Queue Depth' gauge to show the latest row's double8 value instead of averaging across the time window. * feat(gastown): add Sentry capture for reconciler invariant violations Each invariant violation now triggers Sentry.captureMessage with structured context (invariant number, message, townId) as both extra data and tags. Existing analytics event emission is preserved. Added TODO for future auto-recovery of invariant #7 (working agent with no hook). --------- Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com> Co-authored-by: Pedro Heyerdahl <pedro@kilocode.ai> Co-authored-by: Pedro Heyerdahl <61753986+pedroheyerdahl@users.noreply.github.com>
Contributor
Author
|
Refinery code review passed. All quality gates pass. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
writeEvent()inanalytics.util.tsto supportdouble3–double10fields, enabling reconciler metrics emission without breaking existing callers (new fields default to 0).reconciler_tickevent emission after each alarm tick inTown.do.ts, carrying all 9ReconcilerMetricsfields (wallClockMs,eventsDrained,actionsEmitted,sideEffectsAttempted/Succeeded/Failed,invariantViolations,pendingEventCount) plusactionsByTypeas JSON inblob10.Verification
pnpm typecheck— passes (all workspace projects clean)Visual Changes
N/A
Reviewer Notes
writeEvent()doubles array grew from 2 to 10 entries. Analytics Engine supports up to 20, so this is well within limits.actionsByTypeis stored as JSON string in blob10 (vialabelfield) and parsed in Grafana usingJSONExtractKeysAndValues(blob10, 'Float64').SUM(doubleN * _sample_interval) / SUM(_sample_interval)for weighted averages andSUM(doubleN * _sample_interval)for counts — appropriate aggregation for sampled AE data.