Skip to content

perf(ci): build.yml optimization plan — parallelism, caching, and redundancy removal #201

@scottschreckengaust

Description

@scottschreckengaust

Problem

The CI build (build.yml) averages ~14.5 minutes wall time. All steps run sequentially in a single job despite many being independent. No artifact caching is used beyond mise tool binaries.

Measured step timings (5 recent successful runs)

Step Run 1 Run 2 Run 3 Run 4 Run 5 Avg
Free Disk Space 66s 79s 23s 35s 35s 48s
Install mise 3s 13s 8s 3s 3s 6s
Setup Node.js 4s 4s 5s 4s 4s 4s
Install dependencies 77s 78s 66s 76s 76s 75s
build 635s 761s 751s 738s 738s 725s
Upload artifact 15s 15s 15s 14s 14s 15s

The build step (avg 725s / 12min) is the overwhelming bottleneck.

What runs inside mise run build (all sequential)

1. //agent:quality          (~30-40s: ruff lint, ruff format, ty typecheck, pytest)
2. //cdk:build              (~500-600s total)
   ├── :compile             (~30-40s: tsc --build)
   ├── :test                (~90-370s: jest, 99 suites, 1789 tests)
   ├── :eslint              (~30-60s: eslint --fix)
   └── :synth:quiet         (~120-180s: cdk synth with esbuild bundling per Lambda)
3. //cli:build              (~20-30s: compile + test + eslint)
4. //docs:build             (~30-60s: sync-starlight + astro build)
5. //docs:sync              (~5s: DUPLICATE — already runs as dep of //docs:build)

Dependency graph (what actually depends on what)

              ┌── agent:quality ──────────────┐
              ├── cdk:eslint ─────────────────┤
  install ────┼── cdk:test ───────────────────┼── (all pass) ── upload
              ├── cli:build ──────────────────┤
              ├── docs:build ─────────────────┤
              └── cdk:compile → cdk:synth ────┘

Only cdk:synth depends on cdk:compile. Everything else is independent.

Optimization priorities (ranked by impact)

P0: Cache node_modules and .venv (~60-70s saved)

- uses: actions/cache@v4
  with:
    path: |
      node_modules
      agent/.venv
    key: deps-${{ runner.os }}-${{ hashFiles('yarn.lock', 'agent/uv.lock') }}
    restore-keys: deps-${{ runner.os }}-

Turns 75s cold install into ~5s cache restore. First run after lockfile change is still cold.

P1: Parallelize independent jobs (~5-7min wall time saved)

Split the single build job into parallel GHA jobs:

jobs:
  install:
    # checkout + install + cache
  
  agent-quality:
    needs: install
    # ruff, ty, pytest
  
  cdk-compile-synth:
    needs: install
    # tsc → cdk synth → upload artifact
  
  cdk-test:
    needs: install
    # jest (the heaviest step)
  
  cdk-eslint:
    needs: install
    # eslint
  
  cli-build:
    needs: install
    # compile + test + eslint
  
  docs-build:
    needs: install
    # sync + astro build

Critical path drops from ~14.5min to: install (5s cached) → cdk:compile (40s) → cdk:synth (150s) → upload (15s) = ~3.5min

P2: Cache Jest transform output (~30-60s saved on test step)

Add cacheDirectory to jest config:

"cacheDirectory": "<rootDir>/.jest-cache"

Then in CI:

- uses: actions/cache@v4
  with:
    path: cdk/.jest-cache
    key: jest-${{ runner.os }}-${{ hashFiles('cdk/yarn.lock') }}-${{ github.sha }}
    restore-keys: |
      jest-${{ runner.os }}-${{ hashFiles('cdk/yarn.lock') }}-
      jest-${{ runner.os }}-

Cross-branch reuse works because Jest keys by file content hash — unchanged files hit cache regardless of branch.

P3: Cache TypeScript incremental build (~10-20s saved)

- uses: actions/cache@v4
  with:
    path: |
      cdk/tsconfig.tsbuildinfo
      cli/tsconfig.tsbuildinfo
    key: tsc-${{ runner.os }}-${{ hashFiles('cdk/src/**', 'cli/src/**') }}
    restore-keys: tsc-${{ runner.os }}-

P4: Remove duplicate //docs:sync (~5s saved, trivial)

mise.toml root tasks.build calls //docs:sync explicitly after //docs:build, but //docs:build already depends on :sync. Remove the duplicate.

P5 (future): Jest sharding for test parallelism

Only needed if tests remain the critical path after P0-P2. With our beforeAll optimization (PR #195) tests are already ~90s. Sharding would bring that to ~30-40s but adds matrix complexity.

Projected improvement

Scenario Current Cache only (P0+P2+P3) Full parallel + cache (P0-P4)
CI wall time ~14.5min ~10-11min ~4-5min
Critical path 14.5min (serial) 10-11min (serial) compile→synth→upload (~3.5min)
Billed minutes ~14.5 ~10-11 ~20 (more jobs but each shorter)

Acceptance criteria

  • P0: node_modules + .venv cached via actions/cache
  • P1: Build split into parallel jobs with proper needs: graph
  • P2: Jest cacheDirectory set + cached in CI
  • P3: TSC .tsbuildinfo cached
  • P4: Duplicate //docs:sync removed
  • Verify: mutation detection still works (self_mutation check)
  • Verify: cdk.out/ artifact upload still produces correct output
  • Total CI wall time < 6 minutes for cache-warm runs

Notes

  • P1 (parallelism) requires careful handling of the self_mutation / patch detection — currently it runs git diff --staged at the end of the single job. With multiple jobs, each job could mutate files independently (e.g., eslint --fix, docs sync). Need a final "check mutations" job that either re-checks or collects patches.
  • The compute_type matrix currently only has [agentcore]. If more types are added later, each gets its own parallel run — this architecture scales well.
  • Free Disk Space step (avg 48s) is required because CDK synth + Docker image bundling can exhaust the default runner disk. With parallelism, only the cdk-compile-synth job needs this step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions