feat: system backups and alerts pages#101
Draft
andre8244 wants to merge 25 commits into
Draft
Conversation
Contributor
|
🔗 Redirect URIs Added to Logto The following redirect URIs have been automatically added to the Logto application configuration: Redirect URIs:
Post-logout redirect URIs:
These will be automatically removed when the PR is closed or merged. |
80211a6 to
f7d17d2
Compare
f7d17d2 to
a9c986f
Compare
74168a3 to
08a07dd
Compare
a9c986f to
eff1f04
Compare
feea17d to
c4c7f2b
Compare
eff1f04 to
9d111a7
Compare
Member
|
update deploy |
Contributor
|
🚀 Build triggers updated! All |
Contributor
🚨 Breaking My API change detectedStructural change detailsModified (5)
Powered by Bump.sh |
Auto-updated .render-build-trigger files to ensure all services are deployed in PR preview environments. 🤖 Generated by GitHub Actions
Builds the operational alerts surface on top of Mimir Alertmanager: a
single paginated list endpoint plus per-system silence management,
resolved-alert history, and aggregations the UI uses to render the
overview page.
Endpoints:
- GET /alerts (cross-hierarchy / single-tenant / sub-tree scoping,
multi-value label filters, sorting on starts_at/severity/alertname,
pagination with stable fingerprint tiebreaker)
- GET /alerts/history (paginated alert_history rows with date range)
- GET /alerts/totals / /trend / /stats (severity buckets, time-series
deltas, top-N alertname/system_key, MTTR/MTBF)
- GET /alerts/{fingerprint}/activity (silence/unsilence audit timeline,
populated transparently by the silence endpoints)
- GET /systems/{id}/alerts and friends scoped to a single system
Each alert in the list is enriched with a local-DB system object
(id/name/type) so the frontend doesn't need a per-row round-trip.
Per-tenant fan-out failures are surfaced as warnings rather than
failing the whole request.
Gated on the existing read:systems / manage:systems permissions:
read for the list endpoints, manage for silence create/update/delete.
…GET /alerts Stamp system_type at ingest (collect) alongside the other system_* labels and drop the per-request DB lookup that enriched each alert with a separate system object. Saves a SELECT on every GET /alerts and removes a redundant field the frontend never read.
…UNT shortcut, in-process cache
POST /alerts/config used to 500 on invalid emails (slices lacked binding `dive`). Now binding, semantic, and entity-layer failures all go through response.ValidationFailed with JSON-path keys (email_recipients.2.address) and stable codes. Same envelope on the four silence endpoints. Fix reflect.Value.String() leaking "<int Value>" in response.ParseValidationErrors. OpenAPI: reusable SilenceValidationFailed response + inline examples.
The status-gated loop in db-migrate/db-migrate-qa called run_migration.sh status, which only checks for a schema_migrations row and ignores the recorded checksum. Drift on an already-applied migration was silently skipped. Call apply directly so report_checksum_drift fires, and exit the loop on the first non-zero status.
GET /api/filters/alerts ignored alerts firing in Mimir, so systems and organizations with active-but-unresolved alerts never reached the dropdowns. Fan-out to Mimir alongside alert_history, dedupe by system_key / logto_id, resolve org names in a single unified_organizations lookup. Cache per-scope with singleflight (TTL 15s) mirroring /alerts/totals, and surface per-tenant Mimir failures in a warnings[] field.
The scope-aware aggregation of systems/severities/organizations from alert_history — and the per-tenant Mimir fan-out it grew in 4a37666 — could take 17 s on large hierarchies (and returned thousands of systems anyway, defeating the dropdown). Frontend will populate the systems and organizations dropdowns from /api/systems and /api/organizations; severities are a fixed enum the UI hardcodes.
…nizations pagination
/filters/{systems,applications,users} no longer return systems or
organizations; those dropdowns are populated by /api/systems and
/api/organizations, which support search and pagination and scale past
the embedded DISTINCT lists. /filters/systems keeps products,
created_by and versions (small, bounded). Dead helpers removed.
/organizations was returning broken pages: GetAllOrganizationsPaginated
fetched up to pageSize*10 from each org table, then paginated in memory,
so on tenants past a few thousand orgs total_count was wrong and pages
past the first were truncated. Now: single SQL UNION ALL across the
three org tables with RBAC scope, filters, search, ORDER and
LIMIT/OFFSET pushed to the database; COUNT(*) for true totals.
OpenAPI updated: /filters/{systems,applications,users} response shapes,
plus /organizations query parameters (page, page_size, search, name,
description, type, created_by) that the handler already accepted.
…ir command Refuse to replay a migration when later siblings are already recorded (was failing on CREATE OR REPLACE VIEW after a later migration converted the object to a MATERIALIZED VIEW). The error now lists the missing numbers and the exact repair command. Add 'repair <num...>' subcommand to run_migration.sh and a matching 'make db-repair MIGRATIONS=...' target.
…fig_layers Migrations 023 (alert_activity) and 024 (alert_config_layers) had been applied but never reflected in schema.sql, leaving the file out of sync with the cumulative DB state. Append the equivalent CREATE TABLE / INDEX / COMMENT statements so a fresh-from-schema install matches a fully-migrated DB.
The /api/alerts/totals endpoint used to fan out to Mimir once per tenant on every request, with bounded concurrency and a 10s timeout. On owner dashboards covering hundreds of tenants this was 21s in QA. Move the fan-out off the user request path: - New alerts_totals_by_org table (migration 025) carries per-org counts by severity and muted state. - New AlertsTotalsRefresher cron in collect refreshes the table every 60s with one Mimir call per tenant (concurrency 50, 30s timeout). Per-tenant failures are aggregated into a single warn line per cycle to avoid log spam when Mimir is down. - GetAlertsTotals now answers with a single SUM scoped by the same resolveOrgScope that gated the old code path. RBAC and hierarchy semantics are unchanged. The history COUNT keeps the bare-COUNT shortcut for the owner-all path. When the freshest row in scope is older than 5 min the response carries a stale-data warning. The fan-out constants (timeout, concurrency) are renamed to mimirFanout* since they're shared with the silences fan-out and were no longer totals-specific. The Redis-less in-process cache and singleflight infrastructure are removed: the new path is fast enough that they add no value. Tested end-to-end against a local Mimir + 3 systems across 3 customers: counts match Mimir's view exactly for owner, reseller subtrees, customer-pinned scopes, and include=descendants drill-downs, before and after resolve/silence events. Endpoint latency ~5ms. (Schema.sql also adds the new alerts_totals_by_org table next to the 023/024 entries.)
…system POST /api/systems/register is public (the secret IS the credential) but until now testers had to curl it by hand after every create-system, otherwise collect would reject the appliance with 401 invalid system credentials. - New `register-system <system_secret>` subcommand completes the handshake against /systems/register and prints the canonical system_key. Skips the OIDC login since the endpoint is unauthed. - New `--register` flag on create-system chains the registration to the create step so a single command yields a system that's immediately usable for pushing alerts through collect.
…lated the DB On a fresh dev environment the boot path is dev-up -> run -> db-migrate. The backend's database.Init() applies schema.sql when it sees no tables, so by the time db-migrate runs the schema is already at head but schema_migrations is empty. Running the migrations from 001 then trips on non-idempotent statements: 010's CREATE OR REPLACE VIEW unified_organizations errors with "is not a view" because 012's MATERIALIZED VIEW already sits on that name. When apply_migration sees an empty schema_migrations alongside an already-populated public schema, treat the DB as freshly baselined from schema.sql and INSERT a row for every on-disk migration without running it. The very first apply call in a make db-migrate run does this; the rest see "already applied" and no-op. Existing DBs and truly empty DBs are unaffected. This relies on the policy that schema.sql is always kept in sync with migrations/ — a new migration whose effect is NOT folded into schema.sql would be silently marked applied on a fresh init. The runner and the migrations README spell out the invariant; the schema_sync project memory enforces it on the workflow side. Also fixes a stdin-leak in the baseline loop: podman run -i consumed the while-read pipe, so only the first migration was inserted. Added </dev/null on the inner call.
The /alerts/totals endpoint's history figure is a dashboard counter
('Alerts in history'), and exact accuracy is invisible to the user.
On QA the table holds 352k rows and an unconditional COUNT(*) takes
~4.8s — an index-only scan over a stale visibility map (100k+ heap
fetches). Combined with Render's network latency and post-deploy cold
pool warmup, the endpoint hit the 30s gateway timeout on the first few
calls even after we'd already moved the active counts off Mimir.
Replace the owner-all-scope branch with pg_class.reltuples — the
planner's row estimate maintained by autovacuum. Sub-millisecond
(0.06ms vs 4810ms in QA), lags reality by at most the autovacuum
interval (measured ~9% on QA between checkpoints), and is exactly what
this counter is for. Scoped queries (single tenant / IN-list) keep
their exact COUNT(*) — those are bounded by the index on
organization_id and don't suffer the same pathology. If reltuples is
negative (table never analyzed), fall back to the exact COUNT(*) so
fresh installs still render a sensible number on the first hit.
661c1ea to
f895e38
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📋 Description
Related Issue: #83
🚀 Testing Environment
To trigger a fresh deployment of all services in the PR preview environment, comment:
Automatic PR environments:
✅ Merge Checklist
Code Quality:
Builds: