Skip to content

Add pre-publish H5 validation and improve pipeline observability#679

Open
baogorek wants to merge 2 commits intomainfrom
harden/pre-publish-h5-validation
Open

Add pre-publish H5 validation and improve pipeline observability#679
baogorek wants to merge 2 commits intomainfrom
harden/pre-publish-h5-validation

Conversation

@baogorek
Copy link
Copy Markdown
Collaborator

@baogorek baogorek commented Apr 1, 2026

Summary

Addresses three systemic gaps identified in #677 after the dimension-mismatch incident (#675) and the fail-closed fix in #676.

  • Pre-publish H5 validation — New validate_h5.py utility checks that every variable's array length matches its entity's ID array, verifies household_weight exists and is non-zero, and confirms a reasonable household count. Integrated into worker_script.py so malformed H5 files are rejected before being marked as completed — if validation fails, the existing except block records the failure and the file is never uploaded.
  • Pipeline run ID in GH Actionspipeline.yaml now prints the Modal fc.object_id via ::notice annotations, making it trivial to correlate a GH Actions run to the Modal dashboard without guessing timestamps.
  • Versioning workflow resilienceversioning.yaml now uses continue-on-error on the checkout step and a follow-up step that emits a clear ::error annotation linking to Harden artifact compatibility checks between policyengine-us-data and policyengine-us #677 when the POLICYENGINE_GITHUB PAT is expired or missing, replacing the previous cryptic git-auth failure.

Test plan

  • ruff format --check . && ruff check . passes
  • pytest policyengine_us_data/tests/test_validate_h5.py — 6 new tests pass (dimensions match, person variable wrong length, missing household_weight, all-zero weights, plus validate_h5_or_raise pass/raise variants)
  • Review pipeline.yaml diff: ::notice annotation syntax is correct
  • Review versioning.yaml diff: continue-on-error + error step logic is sound
  • Optionally run python -m policyengine_us_data.utils.validate_h5 <local_h5> to verify CLI mode

Closes #677

🤖 Generated with Claude Code

baogorek and others added 2 commits April 1, 2026 16:34
Closes #677 (partial — items 2, 3, 5 from the issue).

- New validate_h5.py utility that checks entity dimension consistency,
  household_weight existence, and zero-weight sanity before upload.
- Integrate validation into worker_script.py so malformed H5 files are
  caught before being marked as completed.
- Surface Modal call ID in pipeline.yaml via ::notice annotations for
  GH Actions → Modal correlation.
- Add continue-on-error + clear ::error annotation to versioning.yaml
  so a broken PAT produces a human-readable failure instead of a
  cryptic git-auth error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pipeline-built files (build_h5) use variable/period nesting while
storage files use flat top-level datasets. Auto-detect layout in
_read_array(). Tests now cover both formats.

Verified against real files: SC.h5 (nested) and extended_cps_2024.h5 (flat).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@baogorek baogorek requested review from MaxGhenis and hua7450 April 1, 2026 21:38
@baogorek
Copy link
Copy Markdown
Collaborator Author

baogorek commented Apr 1, 2026

CI note: The Test / test failure is pre-existing and unrelated to this PR. It's the same version-skew bug class that #675/#676 addressed:

ValueError: Cannot downsample cps_2021: found 9 dataset variables missing from the current
country package (count_under_18, count_under_6, hourly_wage, is_union_member_or_covered,
other_type_retirement_account_distributions, ...).
This usually means policyengine-us-data and policyengine-us are out of sync.

The fail-closed guard from #676 is working as designed — it's catching that the CI environment's policyengine-us doesn't know about variables present in the stored cps_2021.h5/cps_2024.h5 datasets. The puf.py build calls CPS_2021().downsample() which now rejects the skew instead of silently passing.

This same Test / test failure also appears on #676's branch (run 23861700691). It needs to be fixed on main, not here.

@baogorek
Copy link
Copy Markdown
Collaborator Author

baogorek commented Apr 1, 2026

Keeping this open — complementary to #678.

#678 (Max's PR) adds upload-gate contract validation, version tracking, and Microsimulation smoke tests. This PR adds:

  • Early-gate validation in worker_script.py (catches issues at build time, not just upload time)
  • Pipeline observability in pipeline.yaml (Modal call ID in GH Actions annotations)
  • Versioning resilience in versioning.yaml (clear error on PAT expiry)

No file overlap between the two PRs. Plan: let #678 merge first, then rebase this on top. The validate_h5.py utility here also handles the nested variable/period H5 layout that #678's _top_level_dataset_lengths() currently misses — we may want to keep that or port the fix depending on how #678 evolves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Harden artifact compatibility checks between policyengine-us-data and policyengine-us

1 participant