Skip to content

[aw-failures] [aw] Failure Report 2026-05-11 (6h): aw-portfolio-yield workflow defect blocks compile + own runtime + recurrences [Content truncated due to length] #31455

@github-actions

Description

@github-actions

Executive summary

New P0 today: PR #31363 (merged 2026-05-11 ~05:35 UTC) introduced .github/workflows/aw-portfolio-yield.md whose imported shared/otel-observability.md carries a placeholder npm package and an empty OTel endpoint. Two distinct downstream failures result:

  • Agentic Maintenance compile-workflows now fails on every schedule because npx package '@your-org/otel-query-mcp' not found on npm registry. This took the maintenance workflow from green (last success §25653034240 at 05:58 UTC) to two consecutive failures.
  • Agentic Workflow Portfolio Yield itself can't start: MCP Gateway v0.3.6 rejects the config with gateway/opentelemetry/endpoint length must be ≥ 1 (the OTLP_ENDPOINT secret is empty/unset in this repo).

One P1 recurrence — Step Name Alignment hit max_turns=30 again with the same /tmp/gh-aw/cache-memory/* Bash denial signature as #31178.

One P2 mixed — Smoke Claude PR run saw safe_outputs fail on resolve_pull_request_review_thread (Resource not accessible by integration) and also reported missing_tool: playwright-cli.

Failure clusters

# Workflow Run Engine Conclusion Cluster Existing tracking
1 Agentic Maintenance §25657460521 n/a (compile) failure (compile-workflows, 1.4m) aw-portfolio-yield placeholder npm package none (new)
2 Agentic Workflow Portfolio Yield §25654663141 copilot failure (mcp-gateway startup, 40s) aw-portfolio-yield empty OTel endpoint #31439 (auto)
3 Step Name Alignment §25651479635 claude failure (error_max_turns, 3.5m, 31 turns, $1.34) Bash permission denials on /tmp/gh-aw/cache-memory/* #31427 (auto), parent #31178
4 Smoke Claude §25649467832 claude failure (safe_outputs, 27s) resolve_pull_request_review_thread perms + missing_tool: playwright-cli #31410 (auto)

Evidence

Cluster 1 — Agentic Maintenance compile failure (aw-portfolio-yield npm package)

audit of §25657460521: 4 of 11 jobs ran, compile-workflows failed at the Compile workflows step:

✓ Successfully compiled 218 out of 219 workflow files
✗ Compiled 219 workflow(s): 1 error(s), 52 warning(s)
✗ Failed workflows:
  ✗ aw-portfolio-yield.md

.github/workflows/aw-portfolio-yield.md:1:1: error: runtime package validation failed:
  Validation failed for field 'runtime.packages'
  Value: 1 package validation errors
  Reason: runtime package validation failed
  Suggestion: Fix the following package issues:
    Validation failed for field 'npx.packages'
    Value: 1 packages not found
    Reason: npx packages not found on npm registry
    Suggestion: Fix package names or verify they exist on npm:
      npx package '`@your-org/otel-query-mcp`' not found on npm registry: npm error code E404
##[error]Process completed with exit code 1.

The placeholder package name comes from .github/workflows/shared/otel-observability.md:

mcp-servers:
  otel:
    command: npx
    args: ["`@your-org/otel-query-mcp`"]

audit-diff vs the most recent successful Agentic Maintenance run §25653034240 (2026-05-11 05:58 UTC, 5 minutes before #31363 was pushed):

  • Firewall: 0 new domains, 0 status changes, 0 anomalies.
  • The compile step is a deterministic Go binary; the regression is entirely on the workflow source side.
Cluster 2 — Agentic Workflow Portfolio Yield MCP Gateway startup failure (empty OTel endpoint)

audit of §25654663141: activation succeeded (23s), agent job failed at MCP Gateway init (40s):

[error] ERROR: Gateway process (PID: 4114) exited during initialization
config:validation_schema Schema validation failed:
  jsonschema: '/gateway/opentelemetry/endpoint' does not validate with
  https://docs.github.com/gh-aw/schemas/mcp-gateway-config.schema.json#/.../endpoint/minLength:
  length must be >= 1, but got 0
failed to load config: Configuration validation error (MCP Gateway version: v0.3.6):
  Error: length must be >= 1, but got 0
  Error: does not match pattern '^((redacted)+|\${[A-Za-z_][A-Za-z0-9_]*})$'
##[error]Process completed with exit code 1.

The ${{ secrets.OTLP_ENDPOINT }} interpolation in shared/otel-observability.md resolves to an empty string when the secret is unset — and the MCP gateway schema enforces minLength: 1 on gateway/opentelemetry/endpoint, so the gateway refuses to start.

Cluster 3 — Step Name Alignment max-turns (recurrence of #31178)

audit of §25651479635: 31 turns / 1.8M tokens / $1.34, terminated error_max_turns. Permission denial loop from agent-stdio.log:

ls in '/tmp/gh-aw/cache-memory' was blocked. For security, Claude Code may only list
  files in the allowed working directories for this session: '/home/runner/work/gh-aw/gh-aw'.
find in '/tmp/gh-aw/cache-memory' was blocked. ...
cat in '/tmp/gh-aw/cache-memory/step-name-alignment.json' was blocked. ...
mkdir in '/tmp/gh-aw/agent' was blocked. ...

The workflow's --allowed-tools list includes Bash(ls), Bash(cat), Bash(cat /tmp/gh-aw/cache-memory/), Bash(mkdir -p /tmp/gh-aw/cache-memory/), etc., but Claude Code's working-directory restriction (/home/runner/work/gh-aw/gh-aw only) overrides those prefix allows for paths under /tmp. Same exact signature as the 3 prior occurrences tracked in #31178.

audit-diff vs baseline §25620382907 (success on 2026-05-10): turns went from 0 → 31, classification changed, reason_code turns_increase.

Cluster 4 — Smoke Claude safe_outputs failure

audit of §25649467832: agent succeeded (7.3m), safe_outputs job failed (27s). Two distinct errors in 3_safe_outputs.txt:

##[error]Failed to resolve review thread: Request failed due to following response errors:
 - Resource not accessible by integration
##[error]✗ Message 7 (resolve_pull_request_review_thread) failed:
 - Resource not accessible by integration

✓ Recorded missing tool: playwright-cli
   Reason: playwright-cli is not mounted on PATH and not present in
   /home/runner/work/_temp/gh-aw/mcp-cli/manifest.json;
   cannot run browser_navigate/browser_snapshot
   Alternatives: Add playwright MCP server to the workflow's mcp-cli mounts or remove the test step
##[error]1 safe output(s) failed

This was triggered from PR #31398. The GH_AW token used by safe_outputs lacks GraphQL resolveReviewThread permission on the PR. missing_tool is reported as a failure because the workflow sets missing-tool-report-as-failure: true.

Excluded: daily-fact.lock.yml push-triggered CI noise

The runs list shows ~25 failures on the workflow registered as .github/workflows/daily-fact.lock.yml (workflow id 210263564). All failures are event=push on copilot/investigate-failing-agent-step and copilot/fix-agent-job-failure branches — these are noise from Copilot agents iterating in their working branches. The workflow's on: only has schedule + workflow_dispatch, so these runs do not represent operational failures of the deployed workflow on main. Not tracked.

Existing issue correlation

Existing issue State Today's evidence Recommendation
#31439 Agentic Workflow Portfolio Yield failed (auto) open, expires 2026-05-11 18:46 UTC Cluster 2 evidence Link as sub-issue here; root cause sub-issue created
#31427 Step Name Alignment failed (auto) open Cluster 3 — same signature as #31178 Link as sub-issue here; ultimate fix belongs in #31178
#31410 Smoke Claude failed (auto) open Cluster 4 evidence Link as sub-issue here
#31452 Auto-Triage Issues failed (auto) open Run §25657073409 reports success; same "no safe outputs" false-positive class as #31309 No action; auto-expires
#31402 Daily Firewall Logs failed (auto) open Run §25648585256 reports success; same false-positive class as #31287 No action; auto-expires
#31400 AI Moderator failed (auto) open Run §25645846660 reports success; false-positive class No action; auto-expires
#31392 Daily Observability Report failed (auto) open Run §25643399750 reports success; false-positive class (outside 6h window) No action; auto-expires
#31178 Claude max-turns / Bash denials open Cluster 3 is a fresh 4th recurrence Keep open; affected workflows list should include Step Name Alignment
#31221 workflows out of sync open Not observed today No action
#31314 prior parent report (2026-05-10) open, expires 2026-05-17 Yesterday's clusters not seen today No action; superseded by this report

Proposed fix roadmap

P0 (new, this window)

  • Fix shared/otel-observability.md to remove the placeholder npm package and tolerate an unset OTel endpoint. Replace @your-org/otel-query-mcp with the real MCP server name (or remove the otel MCP server entirely until the package is published), and either gate the gateway.opentelemetry.endpoint config block behind OTLP_ENDPOINT being non-empty or change the schema to allow empty endpoint = disabled. (success criteria: gh aw compile .github/workflows/aw-portfolio-yield.md returns 0 errors; a fresh Agentic Workflow Portfolio Yield workflow_dispatch reaches the agent step without MCP Gateway startup failure)

P1 (recurrence)

P2

  • Smoke Claude ([aw] Smoke Claude failed #31410): drop resolve_pull_request_review_thread from the smoke test, or grant the GH_AW safe_outputs token pull-requests:write so the GraphQL resolveReviewThread mutation is accessible. Mount playwright MCP server (or drop the browser_navigate/browser_snapshot test step) to clear the missing_tool warning that's being treated as failure.

Sub-issues linked

References

  • §25657460521 — Agentic Maintenance (failed, compile-workflows)
  • §25654663141 — Agentic Workflow Portfolio Yield (failed, MCP gateway startup)
  • §25651479635 — Step Name Alignment (failed, max-turns)
  • §25649467832 — Smoke Claude (safe_outputs failed)
  • §25653034240 — Agentic Maintenance (last green, used as audit-diff baseline)

Related to #30961

Generated by [aw] Failure Investigator (6h) · ● 39.5M ·

  • expires on May 18, 2026, 8:17 AM UTC

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions