Skip to content

feat: system backups and alerts pages#101

Draft
andre8244 wants to merge 25 commits into
mainfrom
alerts-and-backup-ui
Draft

feat: system backups and alerts pages#101
andre8244 wants to merge 25 commits into
mainfrom
alerts-and-backup-ui

Conversation

@andre8244
Copy link
Copy Markdown
Collaborator

@andre8244 andre8244 commented May 12, 2026

📋 Description

  • Refactor System backups page
  • Refactor Alerts page

Related Issue: #83

🚀 Testing Environment

To trigger a fresh deployment of all services in the PR preview environment, comment:

update deploy

Automatic PR environments:

✅ Merge Checklist

Code Quality:

  • Backend Tests
  • Collect Tests
  • Sync Tests
  • Frontend Tests

Builds:

  • Backend Build
  • Collect Build
  • Sync Build
  • Frontend Build

@edospadoni edospadoni deployed to alerts-and-backup-ui - my-backend-qa PR #101 May 12, 2026 15:42 — with Render Active
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-collect-qa PR #101 May 12, 2026 15:42 — with Render Active
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-frontend-qa PR #101 May 12, 2026 15:42 — with Render Active
@github-actions
Copy link
Copy Markdown
Contributor

🔗 Redirect URIs Added to Logto

The following redirect URIs have been automatically added to the Logto application configuration:

Redirect URIs:

  • https://my-proxy-qa-pr-101.onrender.com/login-redirect

Post-logout redirect URIs:

  • https://my-proxy-qa-pr-101.onrender.com/login

These will be automatically removed when the PR is closed or merged.

@andre8244 andre8244 changed the base branch from main to feat/alerts-config-refactor May 12, 2026 15:44
@andre8244 andre8244 force-pushed the alerts-and-backup-ui branch from 80211a6 to f7d17d2 Compare May 12, 2026 15:45
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-frontend-qa PR #101 May 12, 2026 15:45 — with Render Active
@andre8244 andre8244 force-pushed the alerts-and-backup-ui branch from f7d17d2 to a9c986f Compare May 13, 2026 11:52
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-frontend-qa PR #101 May 13, 2026 11:52 — with Render Active
@edospadoni edospadoni force-pushed the feat/alerts-config-refactor branch from 74168a3 to 08a07dd Compare May 13, 2026 13:42
@andre8244 andre8244 force-pushed the alerts-and-backup-ui branch from a9c986f to eff1f04 Compare May 15, 2026 16:03
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-frontend-qa PR #101 May 15, 2026 16:03 — with Render Active
@andre8244 andre8244 changed the title refactor: system backups page feat: system backups and alerts pages May 15, 2026
@andre8244 andre8244 self-assigned this May 15, 2026
@edospadoni edospadoni force-pushed the feat/alerts-config-refactor branch from feea17d to c4c7f2b Compare May 20, 2026 08:04
Base automatically changed from feat/alerts-config-refactor to main May 20, 2026 08:05
@edospadoni edospadoni force-pushed the alerts-and-backup-ui branch from eff1f04 to 9d111a7 Compare May 20, 2026 09:16
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-frontend-qa PR #101 May 20, 2026 09:16 — with Render Active
@edospadoni
Copy link
Copy Markdown
Member

update deploy

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Build triggers updated!

All .render-build-trigger files have been automatically updated to ensure fresh deployments of all services in the PR preview environment.

@edospadoni edospadoni deployed to alerts-and-backup-ui - my-frontend-qa PR #101 May 20, 2026 09:20 — with Render Active
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-backend-qa PR #101 May 20, 2026 09:20 — with Render Active
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-collect-qa PR #101 May 20, 2026 09:20 — with Render Active
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-mimir-qa PR #101 May 20, 2026 09:20 — with Render Active
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-backend-qa PR #101 May 20, 2026 10:14 — with Render Active
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-backend-qa PR #101 May 20, 2026 15:37 — with Render Active
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-backend-qa PR #101 May 21, 2026 12:31 — with Render Active
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

🚨 Breaking My API change detected

Preview documentation

Structural change details

Modified (5)

  • GET /filters/alerts
    • Response modified: 200
      • Content type modified: application/json
        • Property modified: data
          • [Breaking] Properties removed: systems, severities, organizations
            • Removing a resource is always breaking unless it was deprecated before [Breaking]
    • [Breaking] Query parameters removed: organization_id, include
      • Removing a resource is always breaking unless it was deprecated before [Breaking]
  • GET /filters/applications
    • Response modified: 200
      • Content type modified: application/json
        • Property modified: data
          • [Breaking] Properties removed: systems, organizations
            • Removing a resource is always breaking unless it was deprecated before [Breaking]
  • GET /filters/systems
    • Response modified: 200
      • Content type modified: application/json
        • Property modified: data
          • [Breaking] Property removed: organizations
            • Removing a resource is always breaking unless it was deprecated before [Breaking]
  • GET /filters/users
    • Response modified: 200
      • Content type modified: application/json
        • Property modified: data
          • [Breaking] Property removed: organizations
            • Removing a resource is always breaking unless it was deprecated before [Breaking]
  • GET /organizations
    • Query parameters added: page, page_size, search, name, description, type, created_by
Powered by Bump.sh

@edospadoni edospadoni deployed to alerts-and-backup-ui - my-frontend-qa PR #101 May 21, 2026 15:19 — with Render Active
andre8244 and others added 24 commits May 26, 2026 18:05
Auto-updated .render-build-trigger files to ensure all services
are deployed in PR preview environments.

🤖 Generated by GitHub Actions
Builds the operational alerts surface on top of Mimir Alertmanager: a
single paginated list endpoint plus per-system silence management,
resolved-alert history, and aggregations the UI uses to render the
overview page.

Endpoints:
  - GET /alerts (cross-hierarchy / single-tenant / sub-tree scoping,
    multi-value label filters, sorting on starts_at/severity/alertname,
    pagination with stable fingerprint tiebreaker)
  - GET /alerts/history (paginated alert_history rows with date range)
  - GET /alerts/totals / /trend / /stats (severity buckets, time-series
    deltas, top-N alertname/system_key, MTTR/MTBF)
  - GET /alerts/{fingerprint}/activity (silence/unsilence audit timeline,
    populated transparently by the silence endpoints)
  - GET /systems/{id}/alerts and friends scoped to a single system

Each alert in the list is enriched with a local-DB system object
(id/name/type) so the frontend doesn't need a per-row round-trip.
Per-tenant fan-out failures are surfaced as warnings rather than
failing the whole request.

Gated on the existing read:systems / manage:systems permissions:
read for the list endpoints, manage for silence create/update/delete.
…GET /alerts

Stamp system_type at ingest (collect) alongside the other system_* labels
and drop the per-request DB lookup that enriched each alert with a separate
system object. Saves a SELECT on every GET /alerts and removes a redundant
field the frontend never read.
POST /alerts/config used to 500 on invalid emails (slices lacked
binding `dive`). Now binding, semantic, and entity-layer failures all
go through response.ValidationFailed with JSON-path keys
(email_recipients.2.address) and stable codes. Same envelope on the
four silence endpoints. Fix reflect.Value.String() leaking
"<int Value>" in response.ParseValidationErrors.

OpenAPI: reusable SilenceValidationFailed response + inline examples.
The status-gated loop in db-migrate/db-migrate-qa called run_migration.sh
status, which only checks for a schema_migrations row and ignores the
recorded checksum. Drift on an already-applied migration was silently
skipped. Call apply directly so report_checksum_drift fires, and exit
the loop on the first non-zero status.
GET /api/filters/alerts ignored alerts firing in Mimir, so systems and
organizations with active-but-unresolved alerts never reached the
dropdowns. Fan-out to Mimir alongside alert_history, dedupe by
system_key / logto_id, resolve org names in a single
unified_organizations lookup. Cache per-scope with singleflight (TTL
15s) mirroring /alerts/totals, and surface per-tenant Mimir failures
in a warnings[] field.
The scope-aware aggregation of systems/severities/organizations from
alert_history — and the per-tenant Mimir fan-out it grew in 4a37666 —
could take 17 s on large hierarchies (and returned thousands of systems
anyway, defeating the dropdown). Frontend will populate the systems and
organizations dropdowns from /api/systems and /api/organizations;
severities are a fixed enum the UI hardcodes.
…nizations pagination

/filters/{systems,applications,users} no longer return systems or
organizations; those dropdowns are populated by /api/systems and
/api/organizations, which support search and pagination and scale past
the embedded DISTINCT lists. /filters/systems keeps products,
created_by and versions (small, bounded). Dead helpers removed.

/organizations was returning broken pages: GetAllOrganizationsPaginated
fetched up to pageSize*10 from each org table, then paginated in memory,
so on tenants past a few thousand orgs total_count was wrong and pages
past the first were truncated. Now: single SQL UNION ALL across the
three org tables with RBAC scope, filters, search, ORDER and
LIMIT/OFFSET pushed to the database; COUNT(*) for true totals.

OpenAPI updated: /filters/{systems,applications,users} response shapes,
plus /organizations query parameters (page, page_size, search, name,
description, type, created_by) that the handler already accepted.
…ir command

Refuse to replay a migration when later siblings are already recorded
(was failing on CREATE OR REPLACE VIEW after a later migration converted
the object to a MATERIALIZED VIEW). The error now lists the missing
numbers and the exact repair command. Add 'repair <num...>' subcommand
to run_migration.sh and a matching 'make db-repair MIGRATIONS=...' target.
…fig_layers

Migrations 023 (alert_activity) and 024 (alert_config_layers) had been
applied but never reflected in schema.sql, leaving the file out of sync
with the cumulative DB state. Append the equivalent CREATE TABLE /
INDEX / COMMENT statements so a fresh-from-schema install matches a
fully-migrated DB.
The /api/alerts/totals endpoint used to fan out to Mimir once per tenant
on every request, with bounded concurrency and a 10s timeout. On owner
dashboards covering hundreds of tenants this was 21s in QA.

Move the fan-out off the user request path:

- New alerts_totals_by_org table (migration 025) carries per-org counts
  by severity and muted state.
- New AlertsTotalsRefresher cron in collect refreshes the table every
  60s with one Mimir call per tenant (concurrency 50, 30s timeout).
  Per-tenant failures are aggregated into a single warn line per cycle
  to avoid log spam when Mimir is down.
- GetAlertsTotals now answers with a single SUM scoped by the same
  resolveOrgScope that gated the old code path. RBAC and hierarchy
  semantics are unchanged. The history COUNT keeps the bare-COUNT
  shortcut for the owner-all path. When the freshest row in scope is
  older than 5 min the response carries a stale-data warning.

The fan-out constants (timeout, concurrency) are renamed to
mimirFanout* since they're shared with the silences fan-out and were
no longer totals-specific. The Redis-less in-process cache and
singleflight infrastructure are removed: the new path is fast enough
that they add no value.

Tested end-to-end against a local Mimir + 3 systems across 3 customers:
counts match Mimir's view exactly for owner, reseller subtrees,
customer-pinned scopes, and include=descendants drill-downs, before
and after resolve/silence events. Endpoint latency ~5ms.

(Schema.sql also adds the new alerts_totals_by_org table next to the
023/024 entries.)
…system

POST /api/systems/register is public (the secret IS the credential)
but until now testers had to curl it by hand after every create-system,
otherwise collect would reject the appliance with 401 invalid system
credentials.

- New `register-system <system_secret>` subcommand completes the
  handshake against /systems/register and prints the canonical
  system_key. Skips the OIDC login since the endpoint is unauthed.
- New `--register` flag on create-system chains the registration to
  the create step so a single command yields a system that's
  immediately usable for pushing alerts through collect.
…lated the DB

On a fresh dev environment the boot path is dev-up -> run -> db-migrate.
The backend's database.Init() applies schema.sql when it sees no
tables, so by the time db-migrate runs the schema is already at head
but schema_migrations is empty. Running the migrations from 001 then
trips on non-idempotent statements: 010's CREATE OR REPLACE VIEW
unified_organizations errors with "is not a view" because 012's
MATERIALIZED VIEW already sits on that name.

When apply_migration sees an empty schema_migrations alongside an
already-populated public schema, treat the DB as freshly baselined
from schema.sql and INSERT a row for every on-disk migration without
running it. The very first apply call in a make db-migrate run does
this; the rest see "already applied" and no-op. Existing DBs and
truly empty DBs are unaffected.

This relies on the policy that schema.sql is always kept in sync with
migrations/ — a new migration whose effect is NOT folded into
schema.sql would be silently marked applied on a fresh init. The
runner and the migrations README spell out the invariant; the
schema_sync project memory enforces it on the workflow side.

Also fixes a stdin-leak in the baseline loop: podman run -i consumed
the while-read pipe, so only the first migration was inserted. Added
</dev/null on the inner call.
The /alerts/totals endpoint's history figure is a dashboard counter
('Alerts in history'), and exact accuracy is invisible to the user.
On QA the table holds 352k rows and an unconditional COUNT(*) takes
~4.8s — an index-only scan over a stale visibility map (100k+ heap
fetches). Combined with Render's network latency and post-deploy cold
pool warmup, the endpoint hit the 30s gateway timeout on the first few
calls even after we'd already moved the active counts off Mimir.

Replace the owner-all-scope branch with pg_class.reltuples — the
planner's row estimate maintained by autovacuum. Sub-millisecond
(0.06ms vs 4810ms in QA), lags reality by at most the autovacuum
interval (measured ~9% on QA between checkpoints), and is exactly what
this counter is for. Scoped queries (single tenant / IN-list) keep
their exact COUNT(*) — those are bounded by the index on
organization_id and don't suffer the same pathology. If reltuples is
negative (table never analyzed), fall back to the exact COUNT(*) so
fresh installs still render a sensible number on the first hit.
@andre8244 andre8244 force-pushed the alerts-and-backup-ui branch from 661c1ea to f895e38 Compare May 26, 2026 16:05
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-backend-qa PR #101 May 26, 2026 16:05 — with Render Active
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-collect-qa PR #101 May 26, 2026 16:05 — with Render Active
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-mimir-qa PR #101 May 26, 2026 16:05 — with Render Active
@edospadoni edospadoni deployed to alerts-and-backup-ui - my-frontend-qa PR #101 May 26, 2026 16:05 — with Render Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants