diff --git a/compute-feasibility-advisor-proposal.md b/compute-feasibility-advisor-proposal.md new file mode 100644 index 00000000..2560d127 --- /dev/null +++ b/compute-feasibility-advisor-proposal.md @@ -0,0 +1,201 @@ +# Compute Feasibility Advisor for AutoIntent + +- **Date:** 2026-05-23 +- **Status:** Proposal (pre-implementation) +- **Audience:** AutoIntent maintainers / contributor picking up the task + +## Problem + +AutoIntent's main strength is letting a user kick off a full search-space optimization with one call: + +```python +pipeline = Pipeline.from_preset("transformers-heavy") +pipeline.fit(dataset) +``` + +The cost of that convenience is that users — especially those running on a laptop, a single consumer GPU, or a free cloud instance — cannot tell ahead of time whether their hardware can carry the configuration they have just selected. + +Concrete failure cases we see today: + +- `transformers-heavy` fine-tunes `microsoft/deberta-v3-large` for up to 30 epochs across 40 HPO trials. That needs ~12–18 GB VRAM (full fine-tune, fp32) and many hours of wall time on a single GPU. A user with an 8 GB card finds out by OOM, often several minutes into a run. +- Swapping `intfloat/multilingual-e5-large-instruct` (2 GB) for `sentence-transformers/all-MiniLM-L6-v2` (90 MB) changes the resource bill by an order of magnitude — but nothing surfaces this difference up front. +- Disk is a silent failure mode: a search space referencing several large checkpoints can pull >10 GB into the HF cache before any training starts. + +The target audience for this feature is users with limited resources who pick a preset, hit `fit()`, and want to know within a second whether they should change something. + +## Proposed solution: pre-flight resource advisor + +Add a **pre-flight advisor** that, given a parsed search space and a dataset, estimates worst-case disk, RAM, VRAM, and wall-time requirements from public Hugging Face Hub metadata and a small set of formulas, then prints a clear summary with red/yellow/green warnings. By default it is **report-only and never blocks the run**; an opt-in **reduce-to-fit** mode additionally prunes the search space to fit detected hardware. + +### Scope + +The advisor analyses only the **local, model-bearing** modules whose footprint can be derived from HF Hub metadata. Everything else is either trivial or out of band. + + +| Module category | In scope? | Reason | +| -------------------------------------------------------------------------------- | --------- | -------------------------------------------------- | +| `SentenceTransformerEmbeddingConfig` | yes | local transformer, dominant cost on small machines | +| `VllmEmbeddingConfig` | yes | local transformer with extra engine overhead | +| `HFModelConfig`-based scorers (`bert`, `lora`, `ptuning`, `dnnc`, cross-encoder) | yes | the actual heavyweights | +| GCN scorer when configured with a transformer backbone | yes | inherits the backbone cost | +| `OpenaiEmbeddingConfig` | no | no local resources to estimate | +| `HashingVectorizerEmbeddingConfig` | no | trivial cost | +| `knn`, `mlknn`, `linear`, `sklearn`, `catboost`, `description` | no | negligible next to a fine-tune | +| `decision` and `regex` nodes | no | negligible | + + +Rationale: the user's real risk is the heavy transformer-backed modules. A cheap module cannot be the reason a run fails for resource reasons; we don't owe an estimate for it. + +### Inputs + +- The parsed `OptimizationConfig` (search space, HPO config, embedder/transformer configs). +- The training `Dataset` (for `dataset_size` and an approximate token-length distribution). +- Detected local hardware: + - Total / available RAM via `psutil`. + - Free disk on the AutoIntent / HF cache directory via `shutil.disk_usage`. + - Accelerator detection, in priority order: + - **CUDA:** per-GPU VRAM and device name via `torch.cuda`. + - **MPS (Apple Silicon):** detected via `torch.backends.mps.is_available()`. Apple chips use unified memory, so there is no separate VRAM pool — the "VRAM budget" is a fraction of total system RAM. Default budget = 70 % of total RAM (matching the macOS `PYTORCH_MPS_HIGH_WATERMARK_RATIO` default) with the remainder reserved for the OS and other apps. The fraction is exposed as a knob. + - **CPU only:** when neither is available. + +### Output + +A structured estimate plus a human-readable summary printed to the logger. Example: + +``` +Compute feasibility check +───────────────────────── +Available : 8 GB VRAM (NVIDIA RTX 3060), 32 GB RAM, 120 GB free disk +Estimated worst-case requirements for this search space: + Disk : 5.2 GB (3 unique checkpoints) + RAM : ~4 GB + VRAM : ~14 GB ⚠ exceeds available + Time : ~6 h (single-GPU, fp32, rough) + +Drivers of cost: + scoring.bert microsoft/deberta-v3-large full fine-tune × 40 trials × 30 epochs → ~14 GB VRAM, ~5 h + embedder intfloat/multilingual-e5-large-instruct → ~2.2 GB VRAM + +Suggestions: + • Enable mixed precision (fp16/bf16) on the bert scorer + • Reduce batch_size from 64 to 16 or 32 + • Try preset `transformers-light` or `classic-medium` + +These numbers are heuristic upper bounds, not measurements. +``` + +Numbers are reported with honest precision (one significant figure for time, two for memory) and an explicit "estimate, not measurement" disclaimer. + +### Algorithm (proposal, allowed to adjust) + +1. **Collect candidates.** Walk the search space; collect every unique `(module_type, model_name, mode)` triple, where `mode ∈ {inference, lora, full-finetune}`. Also collect HPO knobs that drive cost: `n_trials`, `epochs`, `batch_size`, `max_length`, `dtype` (fp16/bf16/fp32). +2. **Resolve checkpoints.** For each unique `model_name`, query HF Hub for safetensors metadata to read parameter count and weight dtype. Fall back to file-size aggregation if safetensors metadata is missing. Fall back to a "unknown — heuristic only" tag with low-confidence labelling if HF Hub is offline or the repo is private. +3. **Apply formulas.** + - **Disk** = sum over unique checkpoints of total file size, plus a small fixed overhead per checkpoint for tokenizers and config. + - **RAM** = max over modules of `params × dtype_bytes + dataset_tokens × 4 bytes`, treated as a loose upper bound for tokenized buffers. + - **VRAM per module:** + - Inference embedder: `params × dtype_bytes × ~1.3` (small constant for activations). + - Full fine-tune (`bert`, GCN backbone, soft-prompt `ptuning`): `params × dtype_bytes × (1 + 1 + 2)` for weights + grads + Adam state, halved when fp16/bf16 mixed precision is configured. + - LoRA: inference VRAM + a small adapter constant. + - Reranker (cross-encoder, `dnnc`): inference VRAM × small factor for the reranking pass. + - **Time per module** = `n_trials × epochs × (dataset_size / batch_size) × per_step_seconds(params, max_length, device_class)`, where `per_step_seconds` is a small static lookup table keyed on coarse device class (`cpu`, `low-gpu`, `mid-gpu`, `high-gpu`, `apple-silicon`) auto-detected from `torch.cuda.get_device_name` or `platform`/`torch.backends.mps`. Total time = sum across modules. MPS time numbers are coarser than CUDA's (one tier for now); we accept that. +4. **Compare to detected hardware.** Per-dimension status is green / yellow / red against a configurable headroom (defaults: **red** if estimate > 100 % of available, **yellow** if > 70 %). On MPS, "VRAM" and "RAM" estimates draw from the same physical pool; we compare *the larger of the two* against the unified-memory budget rather than each independently. +5. **Render summary.** Log at INFO. If any dimension is red, emit at WARNING so it shows in non-logging contexts. + +### Failure modes + +- **HF Hub offline or private repo:** fall back to "unknown model — name-pattern heuristic only", explicit low-confidence label, never raise. +- **No accelerator (no CUDA and no MPS):** report VRAM as N/A and mark GPU-only modules as "requires GPU" without estimating a (misleading) CPU wall time. +- **MPS configured but a module is incompatible:** vLLM in particular does not run on MPS. Flag the module as "unsupported on MPS" rather than estimating; do not raise. +- **MPS with CPU fallback ops:** some PyTorch ops fall back to CPU on MPS, inflating system-RAM usage and wall time beyond the heuristic. Note this in the disclaimer; we don't try to model it. +- **vLLM configured but not installed:** still estimate (the VRAM accounting is similar), note that the engine itself has additional overhead not captured. +- **Estimate wildly wrong vs. reality:** always-on disclaimer in the printed summary that these are heuristic upper bounds. + +### Reduce-to-fit mode + +The feasibility check has two modes sharing the same estimation pipeline: + +- **Report mode (default).** Print the summary, return the structured estimate, let the run proceed regardless of severity. +- **Reduce-to-fit mode (opt-in).** Additionally prune the search space to fit detected hardware before the run starts. Same estimates, same comparisons — just one extra step that produces a reduced search space. + +Using the same per-module estimates, the pruner applies three least-destructive steps in order: + +1. **Filter discrete-choice hyperparameters.** For lists of cost-driving values (model name, batch size, training epochs), keep only entries whose worst-case estimate fits. +2. **Cap continuous ranges.** For `{low, high}` ranges of cost-driving parameters, lower the upper bound to the largest fitting value. Ranges of non-cost parameters (learning rate, decision thresholds) are not touched. +3. **Drop module variants.** If a module entry has any required hyperparameter with no satisfiable value left, drop that module entry from its node's search space. + +Guard rails: + +- If pruning would leave any node's search space empty, the pruner **raises**. We don't silently produce a non-runnable pipeline, and we don't quietly fall back to report-only — failing loudly is the right contract for a mode whose whole purpose is to make the run feasible. The error message points the user toward a lighter preset. +- Time is not used as a filter — only memory and disk are. Time is still reported. +- Headroom thresholds are intentionally generous to avoid over-pruning and are configurable. + +Alongside the standard estimate, the caller receives a structured description of what was filtered, capped, and dropped, plus the resulting search space and its recomputed (now green) estimate. + +**Drawbacks worth surfacing.** + +- **Silent narrowing of intent.** A search space deliberately written to include heavy/light variants for comparison gets halved. The mode is opt-in for this reason. +- **Over-pruning when our formulas overestimate.** A 30 %-high estimate on a borderline configuration throws away a run that would have succeeded. Generous headroom defaults mitigate; the knob is exposed. +- **Hard failure when nothing fits.** Raising is intentional — silent degradation to report-only would defeat the mode's purpose — but it is a sharper edge than report mode has. +- **Pre-trial only.** The rewrite happens before any HPO trial starts. This is fine because the search space is treated as immutable across a study, but worth calling out so nobody tries to make this dynamic later. + +## Alternatives considered and rejected + +### B. Smoke-test calibration + +Run each unique module for one mini-batch / one step before the real fit, measure peak RAM and VRAM with `psutil`, `tracemalloc`, and `torch.cuda.max_memory_allocated`, time the step, and extrapolate to the full search space. + +Rejected because: + +- It **downloads weights just to estimate** — the disk-headroom check we wanted to provide is defeated by the act of performing it. +- It can **OOM while predicting OOM**, exactly on the constrained hardware that is the target audience. +- It adds **seconds to minutes** of wall time before `fit()` does anything, surprising users. +- It needs per-module "tiny run" hooks; not every scorer has a clean "stop after one step" path. +- For OpenAI- or vLLM-served embedders, a smoke test costs real money or starts the engine. +- Still not accurate due to CUDA and CPU cache, memory heating and so on. + +### C. Curated benchmark table + +Ship a JSON in the package with measured VRAM and per-step time for the bundled-preset checkpoints, broken out by hardware class (cpu / mid-gpu / high-gpu) and mode (inference / lora / full-finetune). Fall back to heuristics for unknown checkpoints. + +Rejected because: + +- **Maintenance burden:** every new model added to a preset would need entries across the hardware × precision × mode matrix. +- Numbers **go stale** when `transformers` updates change defaults (attention impl, dtype, gradient checkpointing). +- It still needs the chosen-solution heuristics as a long-tail fallback — so it adds work on top of Option A without replacing it. +- **Confident-but-wrong is worse than honest-but-fuzzy.** A table that says "4 GB on 4090" when the user OOMs at 4.5 GB damages trust more than a clearly-labelled range would. + +### D. Layered (A by default, opt-in B, embedded table from C, local actuals cache) + +Combine all three: ship A as the fast path, allow `calibrate=True` to trigger B for heavy modules only, embed a small table from C for the bundled-preset checkpoints, and write actuals from every real run to a local cache that feeds back into future estimates. + +Rejected because: + +- **Implementation surface multiplies:** two estimation code paths to keep consistent, a cache schema with versioning and eviction, two failure modes to document. +- **Discoverability:** users may not learn about `calibrate=True` and the realized value compresses back to roughly Option A anyway. +- The team's bandwidth doesn't justify the marginal accuracy gain over A for the target audience. + +## Comparison + + +| Dimension | A (chosen) | B (smoke-test) | C (benchmark table) | D (layered) | +| -------------------------------- | ------------------------------ | ---------------------- | ---------------------------------- | ------------------------------------- | +| Wall time at pre-flight | < 1 s | seconds–minutes | < 1 s | < 1 s default, s–min when calibrating | +| Accuracy on common checkpoints | medium | high | high | high | +| Accuracy on custom checkpoints | medium | high | medium (fallback) | medium–high | +| Time-estimate quality | low–medium | high | high | high | +| Disk pre-download required | no | yes | no | only when calibrating | +| Risk of OOM during the check | none | real | none | only when calibrating | +| Network usage | 1 cached call per unique model | none beyond normal fit | none | combination | +| Implementation effort | small | large | medium + ongoing benchmark refresh | large + cache infra | +| Ongoing maintenance | low (formulas only) | low | high | high | +| Friendly to offline / air-gapped | with fallback | yes | yes | partial | + + +The chosen solution accepts a real accuracy gap on time and a moderate accuracy gap on VRAM in exchange for the only profile that fits the target audience's constraints: zero added wall time, zero added downloads, zero added failure modes, and a small one-time implementation cost. + +## Out of scope (possible follow-ups) + +- Live resource observability during `fit()` (peak RAM / VRAM per trial, abort on overrun). +- A learned calibration cache from real runs to refine estimates over time. +