Complete calibration_internals.ipynb — document remaining pipeline stages

\`docs/calibration_internals.ipynb\` is the internal implementation reference for the calibration pipeline. It currently documents 3 of 9 major steps and takes too long to run end-to-end because it loads full production datasets. The sections that exist are well-written; this issue tracks what needs to be added.

## What the notebook already covers

- **Calibration matrix structure** — concept, sparsity analysis, target grouping, how clones map to columns (cells 1–20)
- **Hierarchical uprating** — HIF factors, state/district/county target reconciliation (cells 21–29)
- **\`build_h5()\` API** — function signature, parameter semantics, usage examples for national/state/district/city H5s (cells 30–33)

## What is missing

The following pipeline steps have no documentation in the notebook. Each entry lists the key files to read.

### 1. Clone creation and geography assignment
\`datasets/cps/extended_cps.py\`, \`calibration/clone_and_assign.py\`

How the base CPS is cloned N times and each clone is assigned a random census block. This is logically the first step of the pipeline and is currently assumed to have already happened by the time the notebook starts.

### 2. Source variable imputation (ACS / SIPP / SCF)
\`calibration/source_impute.py\`

QRF imputation of non-PUF variables: rent and real_estate_taxes from ACS (with state as predictor), tip_income / bank_account_assets / stock_assets / bond_assets from SIPP (no state predictor), net_worth / auto_loan_balance / auto_loan_interest from SCF. Key decision: which donor surveys provide state-level detail and which don't, and what this means for geographic accuracy.

### 3. PUF cloning and tax variable imputation
\`calibration/puf_impute.py\`

The most novel step. The CPS is cloned 2x: one copy retains CPS values for most variables (with some overridden by QRF), the other gets full PUF imputations with zeroed weights. 70+ tax variables are imputed via stratified QRF. Social Security sub-components (retirement / disability / survivors / dependents) are reconciled from the imputed total via a secondary QRF + age-based heuristic.

### 4. Calibration matrix assembly — per-state simulation
\`calibration/unified_matrix_builder.py\`

The matrix is built by running PolicyEngine per state, per clone. This section documents how that per-state parallel simulation works, how domain constraints (e.g., \`tax_unit_is_filer = 1\` for IRS targets) gate which records contribute to which matrix rows, and how county-dependent variables (e.g., ACA-PTC with county-level premium data) are handled.

### 5. L0 weight optimization
\`calibration/unified_calibration.py\`, \`utils/l0.py\`

The loss function (calibration loss + L0 penalty via Hard Concrete gates), what the hyperparameters BETA/GAMMA/ZETA control, the jitter/annealing schedule, convergence criteria, and how to choose between the \`local\` (λ=1e-8, ~3–4M retained records) and \`national\` (λ=1e-4, ~50K retained records) presets. This is the core algorithmic innovation and currently has no documentation beyond the CLI flag reference in \`calibration.md\`.

### 6. Takeup randomization
\`utils/takeup.py\`

Deterministic block-level seeded RNG for SNAP, ACA-PTC, TANF, SSI, Medicaid, Head Start, and voluntary filing. The seeding is required to keep the matrix builder and the H5 builder consistent — the same (variable, household_id, clone_idx) must produce the same draw. This is a critical correctness invariant that is currently only mentioned in passing in \`calibration.md\`.

### 7. Weight expansion and H5 assembly
\`calibration/publish_local_area.py\`

How the flat weight vector is expanded back into per-clone per-record weights, how entity membership (person → tax_unit → household → spm_unit) is preserved during expansion, and how the geographic hierarchy (state → district → city) is applied via county filter probability scaling.

### 8. Diagnostics — reading calibration output
\`calibration/unified_calibration.py\` (diagnostics section), \`calibration_output/\`

What each column of \`calibration_log.csv\` and \`unified_diagnostics.csv\` means, what good vs poor convergence looks like per target group, and how to identify which targets are driving residual error.

### 9. Pipeline orchestration
\`modal_app/pipeline.py\`

Run ID format (\`{version}_{sha[:8]}_{timestamp}\`), Modal volume layout, step dependency graph, resume logic, HuggingFace staging vs promotion path, and \`meta.json\` structure.

---

## The runtime problem

Most cells currently run against full production datasets. A reader cannot execute the notebook without HuggingFace credentials and a multi-hour runtime. New sections should use one of:

- **Toy inputs** (e.g., 100 CPS records × 3 clones × 5 targets) so every cell finishes in <30s
- **Static code excerpts** — explanation + code snippet, no live execution

Recommendation: mix both. Complex algorithmic steps (L0, matrix assembly) are clearer with toy-input examples. API-reference steps (diagnostics, orchestration) are fine as static excerpts.

## Suggested order to tackle

1. L0 optimization (§5) — nothing else makes sense without understanding the objective
2. PUF imputation (§3) — highest debugging value; most novel
3. Takeup randomization (§6) — small but fixes a critical invisible correctness invariant
4. Diagnostics (§8) — fastest win; immediately useful for anyone running calibration
5. Matrix assembly details (§4) — builds on what's already in the notebook
6. Weight expansion (§7) — fills the gap between the weight vector and the H5 file
7. Clone creation (§1) — logically first; conceptually simpler than the above
8. Source imputation (§2) — standard QRF pattern; lower novelty
9. Pipeline orchestration (§9) — ops-focused; lower priority for algorithm understanding

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete calibration_internals.ipynb — document remaining pipeline stages #643

What the notebook already covers

What is missing

1. Clone creation and geography assignment

2. Source variable imputation (ACS / SIPP / SCF)

3. PUF cloning and tax variable imputation

4. Calibration matrix assembly — per-state simulation

5. L0 weight optimization

6. Takeup randomization

7. Weight expansion and H5 assembly

8. Diagnostics — reading calibration output

9. Pipeline orchestration

The runtime problem

Suggested order to tackle

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Complete calibration_internals.ipynb — document remaining pipeline stages #643

Description

What the notebook already covers

What is missing

1. Clone creation and geography assignment

2. Source variable imputation (ACS / SIPP / SCF)

3. PUF cloning and tax variable imputation

4. Calibration matrix assembly — per-state simulation

5. L0 weight optimization

6. Takeup randomization

7. Weight expansion and H5 assembly

8. Diagnostics — reading calibration output

9. Pipeline orchestration

The runtime problem

Suggested order to tackle

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions