Add pre-publish H5 validation and improve pipeline observability#679
Add pre-publish H5 validation and improve pipeline observability#679
Conversation
Closes #677 (partial — items 2, 3, 5 from the issue). - New validate_h5.py utility that checks entity dimension consistency, household_weight existence, and zero-weight sanity before upload. - Integrate validation into worker_script.py so malformed H5 files are caught before being marked as completed. - Surface Modal call ID in pipeline.yaml via ::notice annotations for GH Actions → Modal correlation. - Add continue-on-error + clear ::error annotation to versioning.yaml so a broken PAT produces a human-readable failure instead of a cryptic git-auth error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pipeline-built files (build_h5) use variable/period nesting while storage files use flat top-level datasets. Auto-detect layout in _read_array(). Tests now cover both formats. Verified against real files: SC.h5 (nested) and extended_cps_2024.h5 (flat). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
CI note: The The fail-closed guard from #676 is working as designed — it's catching that the CI environment's This same |
|
Keeping this open — complementary to #678. #678 (Max's PR) adds upload-gate contract validation, version tracking, and Microsimulation smoke tests. This PR adds:
No file overlap between the two PRs. Plan: let #678 merge first, then rebase this on top. The |
Summary
Addresses three systemic gaps identified in #677 after the dimension-mismatch incident (#675) and the fail-closed fix in #676.
validate_h5.pyutility checks that every variable's array length matches its entity's ID array, verifieshousehold_weightexists and is non-zero, and confirms a reasonable household count. Integrated intoworker_script.pyso malformed H5 files are rejected before being marked as completed — if validation fails, the existingexceptblock records the failure and the file is never uploaded.pipeline.yamlnow prints the Modalfc.object_idvia::noticeannotations, making it trivial to correlate a GH Actions run to the Modal dashboard without guessing timestamps.versioning.yamlnow usescontinue-on-erroron the checkout step and a follow-up step that emits a clear::errorannotation linking to Harden artifact compatibility checks between policyengine-us-data and policyengine-us #677 when thePOLICYENGINE_GITHUBPAT is expired or missing, replacing the previous cryptic git-auth failure.Test plan
ruff format --check . && ruff check .passespytest policyengine_us_data/tests/test_validate_h5.py— 6 new tests pass (dimensions match, person variable wrong length, missinghousehold_weight, all-zero weights, plusvalidate_h5_or_raisepass/raise variants)pipeline.yamldiff:::noticeannotation syntax is correctversioning.yamldiff:continue-on-error+ error step logic is soundpython -m policyengine_us_data.utils.validate_h5 <local_h5>to verify CLI modeCloses #677
🤖 Generated with Claude Code