Harden dataset artifact compatibility checks by MaxGhenis · Pull Request #678 · PolicyEngine/policyengine-us-data

MaxGhenis · 2026-04-01T17:41:41Z

Summary

Closes #677.

This hardens the artifact compatibility contract between policyengine-us-data and policyengine-us in the paths that actually build and publish datasets.

What changed

added a reusable dataset contract validator that:
- checks the installed policyengine-us version matches uv.lock
- rejects HDF5 variables that are unknown to the active country package
- rejects per-entity length mismatches across saved arrays
- runs a file-based Microsimulation(dataset=Dataset.from_file(...)) smoke test
integrated that validator into upload_completed_datasets.py
made modal_app/data_build.py run dataset validation even when --upload is false, so full-suite Modal builds can catch artifact/schema skew before publication
recorded policyengine-us build metadata in the version manifest for promoted artifacts
added regression tests for dataset contract validation and updated version-manifest tests

Validation

uv run ruff check policyengine_us_data/utils/policyengine.py policyengine_us_data/utils/dataset_validation.py policyengine_us_data/storage/upload_completed_datasets.py policyengine_us_data/utils/version_manifest.py policyengine_us_data/tests/conftest.py policyengine_us_data/tests/test_version_manifest.py policyengine_us_data/tests/test_dataset_validation.py modal_app/data_build.py
uv run pytest policyengine_us_data/tests/test_dataset_validation.py policyengine_us_data/tests/test_version_manifest.py -q
uv run python -m py_compile policyengine_us_data/utils/policyengine.py policyengine_us_data/utils/dataset_validation.py policyengine_us_data/storage/upload_completed_datasets.py policyengine_us_data/utils/version_manifest.py modal_app/data_build.py

MaxGhenis · 2026-04-01T17:58:51Z

Pushed a follow-up test commit for extra hardening confidence.

Added regression coverage for two more seams:

upload_completed_datasets.validate_dataset() now has synthetic-H5 tests that prove malformed datasets fail at the upload validator layer, not just the lower-level contract helper
modal_app.data_build now has sequencing tests that prove validation runs before upload, and still runs when upload is disabled

Focused verification on the branch is now:

uv run pytest policyengine_us_data/tests/test_dataset_validation.py policyengine_us_data/tests/test_version_manifest.py policyengine_us_data/tests/test_upload_completed_datasets.py policyengine_us_data/tests/test_modal_data_build.py -q
46 passed

baogorek

Great work on this, Max — the upload-gate validation, PolicyEngineUSBuildInfo version tracking, and Microsimulation smoke test are exactly what #677 calls for. A few notes:

Bug: `_top_level_dataset_lengths()` misses pipeline-built H5 files

The core dimension-checking function (dataset_validation.py L46–54) only reads h5py.Dataset objects at the top level:

for name in h5_file.keys():
    obj = h5_file[name]
    if isinstance(obj, h5py.Dataset):
        dataset_lengths[name] = ...

But build_h5() writes a variable/period nested layout — groups at the top level, with datasets underneath keyed by year. For example, employment_income/2024 rather than a flat employment_income dataset. So for any pipeline-built file, this function returns an empty dict, which means:

Unknown-variable detection sees zero variables → no error raised
Entity-length checks have nothing to compare → silently passes
variable_count in the summary is 0

The Microsimulation smoke test downstream provides a safety net, but the explicit contract checks (which are the whole point of this PR) are effectively bypassed for the files we actually build and publish.

For reference, in #679 we handle this with a _read_array() helper that auto-detects both layouts:

def _read_array(f, var_name, period):
    item = f[var_name]
    if isinstance(item, h5py.Dataset):
        return item  # flat layout
    return item[str(period)]  # nested layout

A similar approach here (iterate h5_file.keys(), check if each is a Group, and if so read the period sub-dataset) would fix it.

Complementary coverage in #679

Our PR #679 touches completely different files — no overlap. It adds:

Worker-level validation in worker_script.py — catches dimension issues right after build_h5(), before a file is marked completed (early gate vs. upload-time gate)
Pipeline observability in pipeline.yaml — surfaces Modal call ID via ::notice annotations
Versioning resilience in versioning.yaml — clear ::error when POLICYENGINE_GITHUB PAT is expired

These complement rather than conflict with the upload-gate validation here. I'd suggest merging this PR first (with the nested layout fix), then we can rebase #679 on top for the worker-level and workflow pieces.

baogorek

Approving — let's get this in. The nested layout gap in _top_level_dataset_lengths() is real but not a blocker: the Microsimulation smoke test still catches bad files at upload time, and our #679 already handles both flat and nested layouts at the worker level. We'll pick that up when #679 rebases on top.

The unknown-variable check rejected variables that policyengine-us-data legitimately creates (hourly_wage, count_under_18, etc.) because they aren't in policyengine-us yet. Downgrade to a warning. Also fix _dataset_lengths (renamed from _top_level_dataset_lengths) to handle the variable/period nested layout that build_h5() produces. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

baogorek · 2026-04-01T22:30:04Z

Pushed a fix (1149aa0) for the CI failure. Two issues:

Unknown-variable false positive — the validator rejected variables like hourly_wage, count_under_18 that policyengine-us-data creates but aren't in policyengine-us yet. Downgraded to a warning. This is what caused the 4h21m CI failure.
Nested H5 layout — renamed _top_level_dataset_lengths → _dataset_lengths and added handling for the variable/period group layout that build_h5() produces. Without this, all contract checks silently passed with an empty dict on pipeline-built files.

Tests updated and passing locally.

…x-677

# Conflicts: # policyengine_us_data/tests/test_dataset_validation.py # policyengine_us_data/tests/test_upload_completed_datasets.py # policyengine_us_data/utils/dataset_validation.py

MaxGhenis added 2 commits April 1, 2026 13:41

Harden dataset artifact compatibility checks

9e418ff

Add regression tests for artifact validation

b2c3b6e

baogorek reviewed Apr 1, 2026

View reviewed changes

baogorek mentioned this pull request Apr 1, 2026

Add pre-publish H5 validation and improve pipeline observability #679

Open

5 tasks

baogorek approved these changes Apr 1, 2026

View reviewed changes

MaxGhenis added 3 commits April 1, 2026 20:46

Merge commit 'f8a54cb4cb075719fc986124a96bebd727b03c37' into codex/fi…

0cee533

…x-677

Handle auxiliary arrays in artifact validation

31a61c5

Merge remote-tracking branch 'upstream/codex/fix-677' into codex/fix-677

d6d24fd

# Conflicts: # policyengine_us_data/tests/test_dataset_validation.py # policyengine_us_data/tests/test_upload_completed_datasets.py # policyengine_us_data/utils/dataset_validation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden dataset artifact compatibility checks#678

Harden dataset artifact compatibility checks#678
MaxGhenis wants to merge 6 commits intomainfrom
codex/fix-677

MaxGhenis commented Apr 1, 2026

Uh oh!

MaxGhenis commented Apr 1, 2026

Uh oh!

baogorek left a comment

Uh oh!

baogorek left a comment

Uh oh!

baogorek commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MaxGhenis commented Apr 1, 2026

Summary

What changed

Validation

Uh oh!

MaxGhenis commented Apr 1, 2026

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Bug: _top_level_dataset_lengths() misses pipeline-built H5 files

Complementary coverage in #679

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

baogorek commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug: `_top_level_dataset_lengths()` misses pipeline-built H5 files