Skip to content

perf(pm): source install scheduler architecture#3029

Draft
elrrrrrrr wants to merge 14 commits into
nextfrom
perf/pm-source-installer
Draft

perf(pm): source install scheduler architecture#3029
elrrrrrrr wants to merge 14 commits into
nextfrom
perf/pm-source-installer

Conversation

@elrrrrrrr
Copy link
Copy Markdown
Contributor

@elrrrrrrr elrrrrrrr commented May 21, 2026

Summary

Draft source PR for the installer half of #2948. This is not the final review unit; it is the source branch used to verify the full installer stack in one place.

This source branch now matches the current installer split-stack top (perf/pm-split-installer-scheduler-core), so the split PRs should compose back to this diff exactly.

Covers From #2948

  • Primitive downloader cache operations.
  • Sync cloner primitive for rayon workers.
  • InstallScheduler-owned download/extract/clone inflight and waiter state.
  • Fresh-lock install scheduling.
  • Stale-lock resolver event scheduling.
  • Pipeline worker/receiver cleanup.
  • Clone/download OnceMap fallback removal from the install hot path.
  • HTTP tarball install docs updated for scheduler-owned cache resolution.

Actual Split Stack

Notes

The scheduler core is split so each PR has a clear review focus: primitives, fresh-lock scheduling, tests, stale-lock event wiring, and final deletion of the old pipeline module. The production PRs avoid dead-code staging and do not use clippy allow escapes.

Validation

Validated at installer split-stack top:

  • cargo fmt
  • cargo check -p utoo-pm
  • cargo test -p utoo-pm service::install_scheduler -- --nocapture (4 passed)
  • cargo clippy --all-targets -- -D warnings --no-deps

pack-napi warns locally because next.js is a symlink in this worktree; clippy exits successfully.

@elrrrrrrr elrrrrrrr added A-Pkg Manager Area: Package Manager benchmark Run pm-bench on PR labels May 21, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new InstallScheduler to manage concurrent package downloads and extractions, replacing the previous pipeline-based implementation. It also increases the manifest fetch concurrency limit for non-semver registries to 256. Feedback focuses on improving memory efficiency and robustness: specifically, tightening backpressure in the download queue to prevent excessive buffering, removing inappropriate panic recovery logic in clone workers, and restoring bounded channels and streaming serialization in the manifest store to prevent memory growth and unnecessary allocations.


fn pump_downloads(&mut self) {
while self.download_active.len() < self.download_limit
&& self.extract_backlog() < self.download_limit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The backpressure check self.extract_backlog() < self.download_limit may allow an excessively large number of downloaded tarballs to be buffered in memory if download_limit is high (e.g., 256 for non-semver registries). Since extraction is limited by extract_limit (max 8), the backlog should be constrained by a smaller value relative to the extraction capacity to prevent high memory usage.

Suggested change
&& self.extract_backlog() < self.download_limit
&& self.extract_backlog() < self.extract_limit * 2

Comment on lines +470 to +480
let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
clone_package_from_cache_sync(
&job.spec.package.name,
&job.spec.package.version,
&job.spec.package.tarball_url,
&job.cache_path,
&job.spec.target,
)
.map_err(|e| format!("{e:#}"))
}))
.unwrap_or_else(|_| Err("install clone worker panicked".to_string()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of std::panic::catch_unwind here violates the general rule regarding panic handling. Panics should be treated as unrecoverable bugs rather than being caught and converted into errors. If a clone worker panics, it should be allowed to propagate or cause the process to terminate.

                let result = clone_package_from_cache_sync(
                    &job.spec.package.name,
                    &job.spec.package.version,
                    &job.spec.package.tarball_url,
                    &job.cache_path,
                    &job.spec.target,
                )
                .map_err(|e| format!("{e:#}"));
References
  1. Do not implement recovery logic for panics. Panics should be treated as unrecoverable bugs that need to be fixed, not as transient, recoverable errors.

Comment thread crates/pm/src/util/manifest_store.rs Outdated
use std::path::{Path, PathBuf};
use std::sync::Arc;
use std::sync::mpsc::{self, SyncSender, TrySendError};
use std::sync::mpsc::{self, Sender};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Reverting the use of SyncSender and TrySendError is necessary to restore the bounded channel and opportunistic write policy for the manifest store.

Suggested change
use std::sync::mpsc::{self, Sender};
use std::sync::mpsc::{self, SyncSender, TrySendError};

Comment thread crates/pm/src/util/manifest_store.rs Outdated

struct ManifestWriter {
tx: SyncSender<ManifestWriteJob>,
tx: Sender<ManifestWriteJob>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The manifest writer should use a SyncSender to support bounded channel operations and backpressure.

Suggested change
tx: Sender<ManifestWriteJob>,
tx: SyncSender<ManifestWriteJob>,

Comment thread crates/pm/src/util/manifest_store.rs Outdated
impl ManifestWriter {
fn spawn() -> Self {
let (tx, rx) = mpsc::sync_channel(MANIFEST_WRITE_QUEUE_CAPACITY);
let (tx, rx) = mpsc::channel();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Changing the manifest writer channel to an unbounded channel removes protection against memory growth if disk I/O stalls. Restoring a bounded sync_channel allows the store to maintain its opportunistic write policy and drop writes when the backlog is too large.

Suggested change
let (tx, rx) = mpsc::channel();
let (tx, rx) = mpsc::sync_channel(1024);

Comment thread crates/pm/src/util/manifest_store.rs Outdated
Comment on lines 135 to 137
if self.tx.send(job).is_err() {
tracing::debug!("Manifest store writer stopped before accepting write");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The manifest writer should use try_send to maintain its opportunistic nature, dropping writes if the queue is full rather than blocking the resolver or allowing unbounded memory growth.

        match self.tx.try_send(job) {
            Ok(()) => {}
            Err(TrySendError::Full(_)) => {
                tracing::debug!("Manifest store writer queue full; dropping cache write");
            }
            Err(TrySendError::Disconnected(_)) => {
                tracing::debug!("Manifest store writer stopped before accepting write");
            }
        }

Comment thread crates/pm/src/util/manifest_store.rs Outdated
Comment on lines 152 to 171
let bytes = match serde_json::to_vec(value) {
Ok(b) => b,
Err(e) => {
tracing::debug!("Failed to serialize {path:?}: {e}");
return;
}
};
match std::fs::write(path, &bytes) {
Ok(()) => {}
Err(e) if e.kind() == ErrorKind::NotFound => {
Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
if let Some(parent) = path.parent()
&& let Err(e) = fs::create_dir_all(parent)
&& let Err(e) = std::fs::create_dir_all(parent)
{
tracing::debug!("Failed to create {parent:?}: {e}");
return;
}
if let Err(e) = write_compact_sync(path, value) {
if let Err(e) = std::fs::write(path, &bytes) {
tracing::debug!("Failed to write {path:?}: {e}");
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using serde_json::to_vec followed by std::fs::write allocates a full Vec<u8> for the serialized manifest, which can be large. Restoring the use of BufWriter and serde_json::to_writer is more memory-efficient as it streams the data directly to the file.

    let write_to_file = |p: &std::path::Path| -> std::io::Result<()> {
        let file = std::fs::File::create(p)?;
        let mut writer = std::io::BufWriter::new(file);
        serde_json::to_writer(&mut writer, value).map_err(|e| std::io::Error::new(std::io::ErrorKind::Other, e))?;
        writer.flush()
    };

    match write_to_file(path) {
        Ok(()) => {}
        Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
            if let Some(parent) = path.parent()
                && let Err(e) = std::fs::create_dir_all(parent)
            {
                tracing::debug!("Failed to create {parent:?}: {e}");
                return;
            }
            if let Err(e) = write_to_file(path) {
                tracing::debug!("Failed to write {path:?}: {e}");
            }
        }
        Err(e) => tracing::debug!("Failed to write {path:?}: {e}"),
    }

@elrrrrrrr elrrrrrrr force-pushed the perf/pm-source-installer branch from 8d7b025 to 1b9becd Compare May 21, 2026 17:43
@elrrrrrrr elrrrrrrr force-pushed the perf/pm-source-installer branch from 22beccd to d826277 Compare May 21, 2026 22:32
@elrrrrrrr elrrrrrrr force-pushed the perf/pm-source-installer branch from d826277 to 0964dc0 Compare May 21, 2026 23:09
@elrrrrrrr elrrrrrrr force-pushed the perf/pm-source-installer branch from 0964dc0 to c244b67 Compare May 21, 2026 23:39
Move install-statistics ownership from global util-layer atomics into the
InstallScheduler, and tighten the clone primitive API per code-guard.

- SchedulerState owns InstallCounts {downloaded, reused, cloned}; handle_done
  counts by phase, shutdown() returns them, install.rs prints them and drops
  its baseline/delta snapshot (each install owns its scheduler, so nested
  global installs stay isolated without it).
- Remove global CLONE_COUNT / DOWNLOAD_COUNT / REUSE_COUNT atomics and
  DownloadStats / download_stats / clone_count from the util layer.
- Replace the find_real bool with a CacheLayout enum (Wrapped / Flat).
- Collapse the 5-arg clone entry points into a PackageClone struct; remove the
  trivial clone_package_from_cache_sync wrapper.
- extract_to_cache returns ExtractOutcome {Reused, Extracted}.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
elrrrrrrr and others added 3 commits May 25, 2026 23:22
- OpDone -> StageReport, SeededCache -> CacheResolved, PackageFetch -> PackageRef.
- Distinguish the two completion channels: async_ops (tokio JoinHandle) vs
  clone_done_rx (rayon blocking clone).
- handle_done -> handle_report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Untangle the three concerns that were mixed in downloader.rs:
- downloader.rs: pure network — download_bytes + client + is_git_url/is_registry_tarball_url.
- package_cache.rs (new): cache-path layout, seeded-slot lookups, registry_cache_*,
  extract_to_cache, download_and_extract_to_cache, ExtractOutcome — owns the _resolved contract.
- extractor.rs: unchanged raw gzip/tar primitive (extract_and_write), now called only by package_cache.

Pure move + re-import, no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · 90c38f4 · linux (ubuntu-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 8.17s 0.24s 10.83s 6.73s 592M 269.3K
utoo-next 7.76s 0.36s 11.92s 8.94s 1012M 127.9K
utoo-npm 8.63s 1.51s 12.05s 9.03s 1.04G 120.9K
utoo 7.02s 0.11s 11.56s 8.61s 1020M 119.1K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 27.0K 16.1K 1.22G 7M 1.93G 1.81G 1M
utoo-next 150.1K 107.9K 1.20G 6M 1.78G 1.77G 2M
utoo-npm 164.6K 118.3K 1.20G 6M 1.78G 1.76G 2M
utoo 134.3K 88.2K 1.20G 6M 1.78G 1.77G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.92s 0.12s 4.46s 0.94s 488M 175.0K
utoo-next 3.52s 0.30s 5.81s 1.44s 619M 81.3K
utoo-npm 3.46s 0.09s 6.00s 1.69s 623M 78.4K
utoo 3.35s 0.04s 5.77s 1.45s 621M 86.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 16.1K 3.6K 205M 3M 109M - 1M
utoo-next 65.7K 71.0K 203M 3M 7M 3M 2M
utoo-npm 97.4K 97.1K 203M 3M 7M 3M 2M
utoo 66.4K 74.9K 203M 3M 7M 3M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 5.77s 0.27s 6.08s 6.39s 520M 184.5K
utoo-next 9.11s 2.27s 5.38s 7.44s 474M 61.7K
utoo-npm 7.54s 2.44s 5.22s 7.14s 447M 67.0K
utoo 5.83s 0.24s 5.22s 7.29s 501M 58.1K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 12.5K 8.8K 1.02G 5M 1.82G 1.82G 1M
utoo-next 149.3K 47.0K 1018M 4M 1.76G 1.76G 2M
utoo-npm 132.1K 49.8K 1018M 4M 1.76G 1.76G 2M
utoo 117.1K 56.8K 1018M 4M 1.76G 1.76G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 2.20s 0.03s 0.14s 1.23s 136M 32.1K
utoo-next 2.02s 0.05s 0.47s 2.42s 80M 18.7K
utoo-npm 1.87s 0.05s 0.50s 2.38s 80M 19.0K
utoo 1.66s 0.13s 0.38s 2.23s 53M 11.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 245 14 428K 13K 1.93G 1.80G 1M
utoo-next 40.8K 16.7K 19K 6K 1.76G 1.76G 2M
utoo-npm 43.5K 17.9K 21K 9K 1.76G 1.76G 2M
utoo 22.4K 14.3K 18K 15K 1.77G 1.76G 2M

npmmirror.com: no output captured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Pkg Manager Area: Package Manager benchmark Run pm-bench on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant