Skip to content

Add dimension guards for ORG hourly_wage imputation#676

Merged
baogorek merged 2 commits intomainfrom
fix/hourly-wage-dimension-mismatch
Apr 1, 2026
Merged

Add dimension guards for ORG hourly_wage imputation#676
baogorek merged 2 commits intomainfrom
fix/hourly-wage-dimension-mismatch

Conversation

@hua7450
Copy link
Copy Markdown
Contributor

@hua7450 hua7450 commented Apr 1, 2026

Summary

  • Adds dimension validation at every stage of the ORG imputation pipeline to prevent hourly_wage (and other ORG variables) from being stored with the wrong number of entries
  • Fixes CPS.downsample() to drop variables unknown to the current policyengine-us instead of silently keeping stale (un-downsampled) arrays

Closes #675

Context

enhanced_cps_2024.h5 on HuggingFace has hourly_wage with 284,250 entries (the raw ORG donor row count) instead of matching the person count in the simulation. This causes a ValueError in Microsimulation() that blocks CI for all open PRs on policyengine-us.

Changes

Dimension guards (4 files)

Location Guard
org.py: predict_org_features() Assert QRF output rows == receiver rows
cps.py: add_org_labor_market_inputs() Assert receiver frame == CPS person count; assert each stored variable == person count
extended_cps.py: _splice_cps_only_predictions() Assert stage-2 predictions == entity half-length
source_impute.py: _impute_org() Assert predictions == dataset person count

Downsample fix

CPS.downsample() previously kept un-downsampled arrays for variables not recognized by the current policyengine-us (via continue). Now it drops them, preventing dimension mismatches when new input variables are added.

Test plan

  • All 7 test_org.py tests pass
  • All 22 test_extended_cps.py tests pass
  • CI passes
  • Regenerate enhanced_cps_2024.h5 and verify hourly_wage has correct dimensions

🤖 Generated with Claude Code

Prevent hourly_wage and other ORG variables from being stored with
wrong dimensions (e.g. ORG donor count instead of CPS person count)
by adding validation checks at every stage of the pipeline:

- predict_org_features: assert output matches receiver frame size
- add_org_labor_market_inputs: assert predictions match CPS person count
- _splice_cps_only_predictions: assert predictions match entity half-size
- _impute_org (source_impute): assert predictions match person count
- CPS.downsample: drop unknown variables instead of keeping stale arrays

Closes #675

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MaxGhenis
Copy link
Copy Markdown
Contributor

Pushed a follow-up commit that broadens this fix beyond the original CPS-only patch.

What changed:

  • moved the downsample logic into a shared helper
  • made downsample fail closed if the dataset contains variables missing from the active country package
  • added an entity-length validation pass before saving the downsampled artifact
  • applied the same guardrails to SCF.downsample() so this bug class does not survive there
  • added focused regression tests for unknown-variable skew and mismatched entity lengths

I also filed #677 for the remaining release-contract work: exact policyengine-us pinning, pre-publish artifact validation, and a fresh-env Microsimulation() smoke test after build/publication.

@baogorek baogorek merged commit f8a54cb into main Apr 1, 2026
6 of 7 checks passed
@baogorek baogorek deleted the fix/hourly-wage-dimension-mismatch branch April 1, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enhanced_cps_2024.h5 has dimension mismatch for hourly_wage — breaks all Microsimulation() calls

3 participants