Skip to content

Harden dataset artifact compatibility checks#678

Open
MaxGhenis wants to merge 6 commits intomainfrom
codex/fix-677
Open

Harden dataset artifact compatibility checks#678
MaxGhenis wants to merge 6 commits intomainfrom
codex/fix-677

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Closes #677.

This hardens the artifact compatibility contract between policyengine-us-data and policyengine-us in the paths that actually build and publish datasets.

What changed

  • added a reusable dataset contract validator that:
    • checks the installed policyengine-us version matches uv.lock
    • rejects HDF5 variables that are unknown to the active country package
    • rejects per-entity length mismatches across saved arrays
    • runs a file-based Microsimulation(dataset=Dataset.from_file(...)) smoke test
  • integrated that validator into upload_completed_datasets.py
  • made modal_app/data_build.py run dataset validation even when --upload is false, so full-suite Modal builds can catch artifact/schema skew before publication
  • recorded policyengine-us build metadata in the version manifest for promoted artifacts
  • added regression tests for dataset contract validation and updated version-manifest tests

Validation

  • uv run ruff check policyengine_us_data/utils/policyengine.py policyengine_us_data/utils/dataset_validation.py policyengine_us_data/storage/upload_completed_datasets.py policyengine_us_data/utils/version_manifest.py policyengine_us_data/tests/conftest.py policyengine_us_data/tests/test_version_manifest.py policyengine_us_data/tests/test_dataset_validation.py modal_app/data_build.py
  • uv run pytest policyengine_us_data/tests/test_dataset_validation.py policyengine_us_data/tests/test_version_manifest.py -q
  • uv run python -m py_compile policyengine_us_data/utils/policyengine.py policyengine_us_data/utils/dataset_validation.py policyengine_us_data/storage/upload_completed_datasets.py policyengine_us_data/utils/version_manifest.py modal_app/data_build.py

@MaxGhenis
Copy link
Copy Markdown
Contributor Author

Pushed a follow-up test commit for extra hardening confidence.

Added regression coverage for two more seams:

  • upload_completed_datasets.validate_dataset() now has synthetic-H5 tests that prove malformed datasets fail at the upload validator layer, not just the lower-level contract helper
  • modal_app.data_build now has sequencing tests that prove validation runs before upload, and still runs when upload is disabled

Focused verification on the branch is now:

  • uv run pytest policyengine_us_data/tests/test_dataset_validation.py policyengine_us_data/tests/test_version_manifest.py policyengine_us_data/tests/test_upload_completed_datasets.py policyengine_us_data/tests/test_modal_data_build.py -q
  • 46 passed

Copy link
Copy Markdown
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this, Max — the upload-gate validation, PolicyEngineUSBuildInfo version tracking, and Microsimulation smoke test are exactly what #677 calls for. A few notes:

Bug: _top_level_dataset_lengths() misses pipeline-built H5 files

The core dimension-checking function (dataset_validation.py L46–54) only reads h5py.Dataset objects at the top level:

for name in h5_file.keys():
    obj = h5_file[name]
    if isinstance(obj, h5py.Dataset):
        dataset_lengths[name] = ...

But build_h5() writes a variable/period nested layout — groups at the top level, with datasets underneath keyed by year. For example, employment_income/2024 rather than a flat employment_income dataset. So for any pipeline-built file, this function returns an empty dict, which means:

  • Unknown-variable detection sees zero variables → no error raised
  • Entity-length checks have nothing to compare → silently passes
  • variable_count in the summary is 0

The Microsimulation smoke test downstream provides a safety net, but the explicit contract checks (which are the whole point of this PR) are effectively bypassed for the files we actually build and publish.

For reference, in #679 we handle this with a _read_array() helper that auto-detects both layouts:

def _read_array(f, var_name, period):
    item = f[var_name]
    if isinstance(item, h5py.Dataset):
        return item  # flat layout
    return item[str(period)]  # nested layout

A similar approach here (iterate h5_file.keys(), check if each is a Group, and if so read the period sub-dataset) would fix it.

Complementary coverage in #679

Our PR #679 touches completely different files — no overlap. It adds:

  • Worker-level validation in worker_script.py — catches dimension issues right after build_h5(), before a file is marked completed (early gate vs. upload-time gate)
  • Pipeline observability in pipeline.yaml — surfaces Modal call ID via ::notice annotations
  • Versioning resilience in versioning.yaml — clear ::error when POLICYENGINE_GITHUB PAT is expired

These complement rather than conflict with the upload-gate validation here. I'd suggest merging this PR first (with the nested layout fix), then we can rebase #679 on top for the worker-level and workflow pieces.

Copy link
Copy Markdown
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — let's get this in. The nested layout gap in _top_level_dataset_lengths() is real but not a blocker: the Microsimulation smoke test still catches bad files at upload time, and our #679 already handles both flat and nested layouts at the worker level. We'll pick that up when #679 rebases on top.

The unknown-variable check rejected variables that policyengine-us-data
legitimately creates (hourly_wage, count_under_18, etc.) because they
aren't in policyengine-us yet. Downgrade to a warning.

Also fix _dataset_lengths (renamed from _top_level_dataset_lengths) to
handle the variable/period nested layout that build_h5() produces.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@baogorek
Copy link
Copy Markdown
Collaborator

baogorek commented Apr 1, 2026

Pushed a fix (1149aa0) for the CI failure. Two issues:

  1. Unknown-variable false positive — the validator rejected variables like hourly_wage, count_under_18 that policyengine-us-data creates but aren't in policyengine-us yet. Downgraded to a warning. This is what caused the 4h21m CI failure.

  2. Nested H5 layout — renamed _top_level_dataset_lengths_dataset_lengths and added handling for the variable/period group layout that build_h5() produces. Without this, all contract checks silently passed with an empty dict on pipeline-built files.

Tests updated and passing locally.

# Conflicts:
#	policyengine_us_data/tests/test_dataset_validation.py
#	policyengine_us_data/tests/test_upload_completed_datasets.py
#	policyengine_us_data/utils/dataset_validation.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Harden artifact compatibility checks between policyengine-us-data and policyengine-us

2 participants