[reliability] Daily Reliability Review - 2026-05-22

### Executive Summary

For the **last 24 hours** in the `gh-aw` Sentry project (org `github`), telemetry shows a small, focused set of operational failures and several observability gaps that are higher-signal than the failures themselves.

- **Total spans:** 19,985 (4,016 `gen_ai` + 7,776 `http.server` + 5,834 `http.client` + 2,380 `default`).
- **Conclusion spans (have `gh-aw.run.status`):** 1,987 — 1,954 `success`, 30 `failure`, 0 `timeout`, 0 `cancelled`.
- **`errors` dataset:** 0 events. **`logs` dataset:** 0 events. (Reported explicitly — these datasets are empty, not skipped.)
- **OTLP self-reported export failures (`gh-aw.otlp.export_errors:>0`):** 0.
- **Confirmed operational failures:** all 30 are safe-output validation rejections (item-count and body-length rules), concentrated in **Smoke Copilot (20)**, **LintMonster (4)**, **Daily CLI Tools Exploratory Tester (4)**, **Deployment Incident Monitor (2)**.
- **Instrumentation gaps (high signal):** `span.status`, `gen_ai.response.finish_reasons`, `release`, and `service.version` are null across **100% of spans** in the 24h window, including agent conclusion spans where the emitter is supposed to populate them.

Overall health: **operationally green** (1.5% failure rate on conclusion spans, all due to a known validation pattern), but **observability is degraded** — runtime outcome on the OTLP `status.code` channel, agent stop-reason on `gen_ai.response.finish_reasons`, and release correlation are all unreadable in the Sentry query layer, so traces look healthier than they are and length-truncation cannot be detected.

### Top Reliability Findings

| Priority | Workflow | Problem | Evidence | Next Action |
| --- | --- | --- | --- | --- |
| P1 | Smoke Copilot | 20 conclusion spans with `gh-aw.run.status:failure` — safe-output validation rejects (`create_discussion 'body' is too short (minimum 64 characters)`, `Too many items of type 'add_comment'. Maximum allowed: 2.`) | `gh-aw.run.status:failure gh-aw.workflow.name:"Smoke Copilot"` returned 20 spans; trace `b6215698c10f141728d30cc4ef48fe93` and trace `ced2ef961311a5be0aafbfab945933c0` both repeat the same body-length and item-count rejection | Fix Copilot smoke prompt to emit `create_discussion` bodies ≥ 64 chars and respect `add_comment` cap of 2; the same payload was retried at least 4 times across the window |
| P1 | Daily CLI Tools Exploratory Tester | 4 failures — `Too many items of type 'create_issue'. Maximum allowed: 1.` | Trace `42c90277cf86f99e7ffd6d26f00f65cf` (2026-05-22T06:29:58Z) | Tighten exploratory tester prompt or raise the `create_issue` safe-output limit if 1 is too aggressive for this workflow |
| P2 | LintMonster | 4 failures | Trace `56d87d5d7b58fc8386917605b5b35b53` (2026-05-22T03:44–03:50Z, four spans on the same trace) | Inspect that one run for repeated errors — pattern looks like one bad run, not a recurring class |
| P2 | Deployment Incident Monitor | 2 failures | `gh-aw.run.status:failure gh-aw.workflow.name:"Deployment Incident Monitor"` count=2 | Confirm whether expected for this monitor's contract; low volume |
| P1 (observability) | _all workflows_ | `span.status` is **null on 1,987/1,987 conclusion spans** | aggregate by `span.status` with `has:gh-aw.run.status` → single bucket `{null: 1987}` | OTLP `status.code` is set in `actions/setup/js/send_otlp_span.cjs:305,1837` — investigate whether Sentry's spans dataset surfaces OTLP status under a different field (e.g. `span.status_code`) or whether the OTLP exporter is stripping it; until then dashboards must rely on `gh-aw.run.status` |
| P1 (observability) | _agent conclusion spans_ | `gen_ai.response.finish_reasons` is **null on all 1,987 conclusion spans and all 4,016 `span.op:gen_ai` spans** | aggregate on `gen_ai.response.finish_reasons` for both `span.op:gen_ai` (4016 null) and `has:gh-aw.run.status` (1987 null) | Emitter at `actions/setup/js/send_otlp_span.cjs:1899-1900` claims the array is always emitted on `jobName === "agent"` conclusion spans (with `unknown`/`timeout` sentinel). Either (a) `jobName === "agent"` is never matched in the live emit path, or (b) `buildArrayAttr` array values are being dropped by the exporter / not indexed by Sentry. Length-truncation is currently undetectable — fix this before chasing further runtime issues |
| P2 (observability) | _all spans_ | `release` and `service.version` null on **19,985/19,985 spans** | aggregate by `release` and `service.version` → single null bucket each | Resource attr emitted only when `scopeVersion && scopeVersion !== "unknown"` (`send_otlp_span.cjs:321-323`); set a real version (gh-aw release tag) on the setup action's scope so release correlation works in Sentry |
| P3 | _multiple long-running agents_ | Several `gen_ai` spans run 15–22 min — Copilot Agent Prompt Clustering Analysis max 1,317,663 ms; Copilot Session Insights max 1,300,291 ms; `[aw] Failure Investigator (6h)` 40 spans avg 74 s, max 1,140,848 ms | sort by `-max(span.duration)` on `span.op:gen_ai`; example trace `9a9796b0ea3622fee6d4f9b26f8930c2` (Failure Investigator, 19 min) | Likely normal for daily agents; flagged only because `finish_reasons` is missing, so we cannot distinguish a deliberate long run from a length-truncated run |

### Representative Traces

<details>
<summary>View representative traces</summary>

- **Smoke Copilot — repeating safe-output validation failure** (most common failure class)
  - Trace `b6215698c10f141728d30cc4ef48fe93` — https://github.sentry.io/explore/traces/trace/b6215698c10f141728d30cc4ef48fe93 — `gh-aw.error.messages: Line 2: create_discussion 'body' is too short (minimum 64 characters)`
  - Trace `ced2ef961311a5be0aafbfab945933c0` — https://github.sentry.io/explore/traces/trace/ced2ef961311a5be0aafbfab945933c0 — `gh-aw.error.messages: Line 4: create_discussion 'body' is too short (minimum 64 characters) | Line 10: Too many items of type 'add_comment'. Maximum allowed: 2.`
  - Trace `20f72a27241e4d2f3fd6df073a612f9c` — https://github.sentry.io/explore/traces/trace/20f72a27241e4d2f3fd6df073a612f9c — `Too many items of type 'add_comment'. Maximum allowed: 2.`
- **Daily CLI Tools Exploratory Tester — `create_issue` cap exceeded**
  - Trace `42c90277cf86f99e7ffd6d26f00f65cf` — https://github.sentry.io/explore/traces/trace/42c90277cf86f99e7ffd6d26f00f65cf — `gh-aw.error.messages: Line 2: Too many items of type 'create_issue'. Maximum allowed: 1.`
- **LintMonster — clustered failures on a single trace**
  - Trace `56d87d5d7b58fc8386917605b5b35b53` — https://github.sentry.io/explore/traces/trace/56d87d5d7b58fc8386917605b5b35b53
- **Longest gen_ai span — runtime outcome unknown** (`finish_reasons` null so cannot tell if completed or length-truncated)
  - Trace `9a9796b0ea3622fee6d4f9b26f8930c2` — `[aw] Failure Investigator (6h)`, single gen_ai span 1,140,848 ms (~19 min)
  - Trace from Copilot Agent Prompt Clustering Analysis — max 1,317,663 ms (~22 min)

</details>

### Recommendations

1. **Fix the Smoke Copilot safe-output prompt contract first** (smallest fix, largest noise reduction). Twenty of the 30 failures are the same two validation messages — either lengthen the discussion body template in the workflow to ≥ 64 chars or relax the rule, and constrain the agent to ≤ 2 `add_comment` items. This alone drops the daily failure count from 30 to ~10.
2. **Restore `gen_ai.response.finish_reasons` on agent conclusion spans.** The emitter contract at `actions/setup/js/send_otlp_span.cjs:1899-1900` says these are always emitted, but Sentry sees null for 100% of conclusion spans. Add a unit-level assertion that `attributes` contains the key when `jobName === "agent"`, and verify the OTLP exporter is not dropping `buildArrayAttr` payloads. Without this we cannot detect length-truncation or distinguish timeouts from clean exits.
3. **Populate `release` / `service.version` on the resource scope.** `send_otlp_span.cjs:321-323` skips emit when `scopeVersion` is `"unknown"`. Pass the gh-aw action version (or commit SHA) into setup so release correlation works; otherwise we cannot bisect a regression by deploy.
4. **Decide whether to keep relying on `gh-aw.run.status` for failure queries.** Sentry's `span.status` is null for all conclusion spans even though OTLP `status.code` is set in the emit payload. Either confirm that Sentry's spans dataset uses a different field (e.g. `span.status_code`), or document `gh-aw.run.status` as the canonical failure attribute and stop documenting OTLP `status.code` as queryable.

### Notes

<details>
<summary>View notes</summary>

- The Sentry MCP build used here exposes `list_events` but not `search_events` or `get_trace_details`; trace continuity was verified by filtering `list_events` on `trace:<id>` and ordering by timestamp.
- The `errors` and `logs` datasets returning zero events is an **explicit observability finding**, not a skipped check — either no SDK is configured to send errors/logs for this project, or no error/log events were emitted in the window.
- `gh-aw.run.status:timeout` and `gh-aw.run.status:cancelled` both returned 0 events. The emitter at `send_otlp_span.cjs:1820-1830` distinguishes these states; the absence is meaningful (no runs timed out or were cancelled in the window), not a missing attribute.
- `gen_ai.response.finish_reasons:length` could not be queried for truncation because the attribute is null everywhere. The check is **inconclusive runtime outcome + confirmed instrumentation gap**, not a confirmed clean run.
- All latency outliers cited include count, max, and trace ID per the evidence-first contract; one-off long runs are flagged but not promoted to P1 because they cluster on long-by-design daily agents.

</details>

**References:**
- [§26316416282](https://github.com/github/gh-aw/actions/runs/26316416282)







> Generated by [🚨 Daily Reliability Review](https://github.com/github/gh-aw/actions/runs/26316416282) · ● 9.2M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-reliability-review%22&type=issues)
> - [x] expires  on May 24, 2026, 11:21 PM UTC

Priority	Workflow	Problem	Evidence	Next Action
P1	Smoke Copilot	20 conclusion spans with `gh-aw.run.status:failure` — safe-output validation rejects (`create_discussion 'body' is too short (minimum 64 characters)`, `Too many items of type 'add_comment'. Maximum allowed: 2.`)	`gh-aw.run.status:failure gh-aw.workflow.name:"Smoke Copilot"` returned 20 spans; trace `b6215698c10f141728d30cc4ef48fe93` and trace `ced2ef961311a5be0aafbfab945933c0` both repeat the same body-length and item-count rejection	Fix Copilot smoke prompt to emit `create_discussion` bodies ≥ 64 chars and respect `add_comment` cap of 2; the same payload was retried at least 4 times across the window
P1	Daily CLI Tools Exploratory Tester	4 failures — `Too many items of type 'create_issue'. Maximum allowed: 1.`	Trace `42c90277cf86f99e7ffd6d26f00f65cf` (2026-05-22T06:29:58Z)	Tighten exploratory tester prompt or raise the `create_issue` safe-output limit if 1 is too aggressive for this workflow
P2	LintMonster	4 failures	Trace `56d87d5d7b58fc8386917605b5b35b53` (2026-05-22T03:44–03:50Z, four spans on the same trace)	Inspect that one run for repeated errors — pattern looks like one bad run, not a recurring class
P2	Deployment Incident Monitor	2 failures	`gh-aw.run.status:failure gh-aw.workflow.name:"Deployment Incident Monitor"` count=2	Confirm whether expected for this monitor's contract; low volume
P1 (observability)	all workflows	`span.status` is null on 1,987/1,987 conclusion spans	aggregate by `span.status` with `has:gh-aw.run.status` → single bucket `{null: 1987}`	OTLP `status.code` is set in `actions/setup/js/send_otlp_span.cjs:305,1837` — investigate whether Sentry's spans dataset surfaces OTLP status under a different field (e.g. `span.status_code`) or whether the OTLP exporter is stripping it; until then dashboards must rely on `gh-aw.run.status`
P1 (observability)	agent conclusion spans	`gen_ai.response.finish_reasons` is null on all 1,987 conclusion spans and all 4,016 `span.op:gen_ai` spans	aggregate on `gen_ai.response.finish_reasons` for both `span.op:gen_ai` (4016 null) and `has:gh-aw.run.status` (1987 null)	Emitter at `actions/setup/js/send_otlp_span.cjs:1899-1900` claims the array is always emitted on `jobName === "agent"` conclusion spans (with `unknown`/`timeout` sentinel). Either (a) `jobName === "agent"` is never matched in the live emit path, or (b) `buildArrayAttr` array values are being dropped by the exporter / not indexed by Sentry. Length-truncation is currently undetectable — fix this before chasing further runtime issues
P2 (observability)	all spans	`release` and `service.version` null on 19,985/19,985 spans	aggregate by `release` and `service.version` → single null bucket each	Resource attr emitted only when `scopeVersion && scopeVersion !== "unknown"` (`send_otlp_span.cjs:321-323`); set a real version (gh-aw release tag) on the setup action's scope so release correlation works in Sentry
P3	multiple long-running agents	Several `gen_ai` spans run 15–22 min — Copilot Agent Prompt Clustering Analysis max 1,317,663 ms; Copilot Session Insights max 1,300,291 ms; `[aw] Failure Investigator (6h)` 40 spans avg 74 s, max 1,140,848 ms	sort by `-max(span.duration)` on `span.op:gen_ai`; example trace `9a9796b0ea3622fee6d4f9b26f8930c2` (Failure Investigator, 19 min)	Likely normal for daily agents; flagged only because `finish_reasons` is missing, so we cannot distinguish a deliberate long run from a length-truncated run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reliability] Daily Reliability Review - 2026-05-22 #34137

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[reliability] Daily Reliability Review - 2026-05-22 #34137

Description

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions