Conversation
|
Pushed a follow-up test commit for extra hardening confidence. Added regression coverage for two more seams:
Focused verification on the branch is now:
|
baogorek
left a comment
There was a problem hiding this comment.
Great work on this, Max — the upload-gate validation, PolicyEngineUSBuildInfo version tracking, and Microsimulation smoke test are exactly what #677 calls for. A few notes:
Bug: _top_level_dataset_lengths() misses pipeline-built H5 files
The core dimension-checking function (dataset_validation.py L46–54) only reads h5py.Dataset objects at the top level:
for name in h5_file.keys():
obj = h5_file[name]
if isinstance(obj, h5py.Dataset):
dataset_lengths[name] = ...But build_h5() writes a variable/period nested layout — groups at the top level, with datasets underneath keyed by year. For example, employment_income/2024 rather than a flat employment_income dataset. So for any pipeline-built file, this function returns an empty dict, which means:
- Unknown-variable detection sees zero variables → no error raised
- Entity-length checks have nothing to compare → silently passes
variable_countin the summary is 0
The Microsimulation smoke test downstream provides a safety net, but the explicit contract checks (which are the whole point of this PR) are effectively bypassed for the files we actually build and publish.
For reference, in #679 we handle this with a _read_array() helper that auto-detects both layouts:
def _read_array(f, var_name, period):
item = f[var_name]
if isinstance(item, h5py.Dataset):
return item # flat layout
return item[str(period)] # nested layoutA similar approach here (iterate h5_file.keys(), check if each is a Group, and if so read the period sub-dataset) would fix it.
Complementary coverage in #679
Our PR #679 touches completely different files — no overlap. It adds:
- Worker-level validation in
worker_script.py— catches dimension issues right afterbuild_h5(), before a file is marked completed (early gate vs. upload-time gate) - Pipeline observability in
pipeline.yaml— surfaces Modal call ID via::noticeannotations - Versioning resilience in
versioning.yaml— clear::errorwhenPOLICYENGINE_GITHUBPAT is expired
These complement rather than conflict with the upload-gate validation here. I'd suggest merging this PR first (with the nested layout fix), then we can rebase #679 on top for the worker-level and workflow pieces.
baogorek
left a comment
There was a problem hiding this comment.
Approving — let's get this in. The nested layout gap in _top_level_dataset_lengths() is real but not a blocker: the Microsimulation smoke test still catches bad files at upload time, and our #679 already handles both flat and nested layouts at the worker level. We'll pick that up when #679 rebases on top.
The unknown-variable check rejected variables that policyengine-us-data legitimately creates (hourly_wage, count_under_18, etc.) because they aren't in policyengine-us yet. Downgrade to a warning. Also fix _dataset_lengths (renamed from _top_level_dataset_lengths) to handle the variable/period nested layout that build_h5() produces. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Pushed a fix (1149aa0) for the CI failure. Two issues:
Tests updated and passing locally. |
# Conflicts: # policyengine_us_data/tests/test_dataset_validation.py # policyengine_us_data/tests/test_upload_completed_datasets.py # policyengine_us_data/utils/dataset_validation.py
Summary
Closes #677.
This hardens the artifact compatibility contract between
policyengine-us-dataandpolicyengine-usin the paths that actually build and publish datasets.What changed
policyengine-usversion matchesuv.lockMicrosimulation(dataset=Dataset.from_file(...))smoke testupload_completed_datasets.pymodal_app/data_build.pyrun dataset validation even when--uploadis false, so full-suite Modal builds can catch artifact/schema skew before publicationpolicyengine-usbuild metadata in the version manifest for promoted artifactsValidation
uv run ruff check policyengine_us_data/utils/policyengine.py policyengine_us_data/utils/dataset_validation.py policyengine_us_data/storage/upload_completed_datasets.py policyengine_us_data/utils/version_manifest.py policyengine_us_data/tests/conftest.py policyengine_us_data/tests/test_version_manifest.py policyengine_us_data/tests/test_dataset_validation.py modal_app/data_build.pyuv run pytest policyengine_us_data/tests/test_dataset_validation.py policyengine_us_data/tests/test_version_manifest.py -quv run python -m py_compile policyengine_us_data/utils/policyengine.py policyengine_us_data/utils/dataset_validation.py policyengine_us_data/storage/upload_completed_datasets.py policyengine_us_data/utils/version_manifest.py modal_app/data_build.py