Skip to content

Complete calibration_internals.ipynb — document remaining pipeline stages #643

@juaristi22

Description

@juaristi22

`docs/calibration_internals.ipynb` is the internal implementation reference for the calibration pipeline. It currently documents 3 of 9 major steps and takes too long to run end-to-end because it loads full production datasets. The sections that exist are well-written; this issue tracks what needs to be added.

What the notebook already covers

  • Calibration matrix structure — concept, sparsity analysis, target grouping, how clones map to columns (cells 1–20)
  • Hierarchical uprating — HIF factors, state/district/county target reconciliation (cells 21–29)
  • `build_h5()` API — function signature, parameter semantics, usage examples for national/state/district/city H5s (cells 30–33)

What is missing

The following pipeline steps have no documentation in the notebook. Each entry lists the key files to read.

1. Clone creation and geography assignment

`datasets/cps/extended_cps.py`, `calibration/clone_and_assign.py`

How the base CPS is cloned N times and each clone is assigned a random census block. This is logically the first step of the pipeline and is currently assumed to have already happened by the time the notebook starts.

2. Source variable imputation (ACS / SIPP / SCF)

`calibration/source_impute.py`

QRF imputation of non-PUF variables: rent and real_estate_taxes from ACS (with state as predictor), tip_income / bank_account_assets / stock_assets / bond_assets from SIPP (no state predictor), net_worth / auto_loan_balance / auto_loan_interest from SCF. Key decision: which donor surveys provide state-level detail and which don't, and what this means for geographic accuracy.

3. PUF cloning and tax variable imputation

`calibration/puf_impute.py`

The most novel step. The CPS is cloned 2x: one copy retains CPS values for most variables (with some overridden by QRF), the other gets full PUF imputations with zeroed weights. 70+ tax variables are imputed via stratified QRF. Social Security sub-components (retirement / disability / survivors / dependents) are reconciled from the imputed total via a secondary QRF + age-based heuristic.

4. Calibration matrix assembly — per-state simulation

`calibration/unified_matrix_builder.py`

The matrix is built by running PolicyEngine per state, per clone. This section documents how that per-state parallel simulation works, how domain constraints (e.g., `tax_unit_is_filer = 1` for IRS targets) gate which records contribute to which matrix rows, and how county-dependent variables (e.g., ACA-PTC with county-level premium data) are handled.

5. L0 weight optimization

`calibration/unified_calibration.py`, `utils/l0.py`

The loss function (calibration loss + L0 penalty via Hard Concrete gates), what the hyperparameters BETA/GAMMA/ZETA control, the jitter/annealing schedule, convergence criteria, and how to choose between the `local` (λ=1e-8, ~3–4M retained records) and `national` (λ=1e-4, ~50K retained records) presets. This is the core algorithmic innovation and currently has no documentation beyond the CLI flag reference in `calibration.md`.

6. Takeup randomization

`utils/takeup.py`

Deterministic block-level seeded RNG for SNAP, ACA-PTC, TANF, SSI, Medicaid, Head Start, and voluntary filing. The seeding is required to keep the matrix builder and the H5 builder consistent — the same (variable, household_id, clone_idx) must produce the same draw. This is a critical correctness invariant that is currently only mentioned in passing in `calibration.md`.

7. Weight expansion and H5 assembly

`calibration/publish_local_area.py`

How the flat weight vector is expanded back into per-clone per-record weights, how entity membership (person → tax_unit → household → spm_unit) is preserved during expansion, and how the geographic hierarchy (state → district → city) is applied via county filter probability scaling.

8. Diagnostics — reading calibration output

`calibration/unified_calibration.py` (diagnostics section), `calibration_output/`

What each column of `calibration_log.csv` and `unified_diagnostics.csv` means, what good vs poor convergence looks like per target group, and how to identify which targets are driving residual error.

9. Pipeline orchestration

`modal_app/pipeline.py`

Run ID format (`{version}{sha[:8]}{timestamp}`), Modal volume layout, step dependency graph, resume logic, HuggingFace staging vs promotion path, and `meta.json` structure.


The runtime problem

Most cells currently run against full production datasets. A reader cannot execute the notebook without HuggingFace credentials and a multi-hour runtime. New sections should use one of:

  • Toy inputs (e.g., 100 CPS records × 3 clones × 5 targets) so every cell finishes in <30s
  • Static code excerpts — explanation + code snippet, no live execution

Recommendation: mix both. Complex algorithmic steps (L0, matrix assembly) are clearer with toy-input examples. API-reference steps (diagnostics, orchestration) are fine as static excerpts.

Suggested order to tackle

  1. L0 optimization (§5) — nothing else makes sense without understanding the objective
  2. PUF imputation (§3) — highest debugging value; most novel
  3. Takeup randomization (§6) — small but fixes a critical invisible correctness invariant
  4. Diagnostics (§8) — fastest win; immediately useful for anyone running calibration
  5. Matrix assembly details (§4) — builds on what's already in the notebook
  6. Weight expansion (§7) — fills the gap between the weight vector and the H5 file
  7. Clone creation (§1) — logically first; conceptually simpler than the above
  8. Source imputation (§2) — standard QRF pattern; lower novelty
  9. Pipeline orchestration (§9) — ops-focused; lower priority for algorithm understanding

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions