Skip to content

WIP: shared map cache#346

Open
eugenevinitsky wants to merge 10 commits into3.0from
ev/reuse_maps
Open

WIP: shared map cache#346
eugenevinitsky wants to merge 10 commits into3.0from
ev/reuse_maps

Conversation

@eugenevinitsky
Copy link

@eugenevinitsky eugenevinitsky commented Mar 18, 2026

Summary

  • Adds shared map infrastructure to reduce memory usage by caching and reusing map data (road elements, grid maps, neighbor caches) across PufferDrive instances
  • Fixes use-after-free segfault when multiple PufferDrive instances in the same worker process shared a map cache that got rebuilt
  • Skips binaries dir in code isolation to avoid slow NFS walks on cluster

Test plan

  • Verified fix at scale: 96 envs × 16 workers × 1024 agents on CARLA 3D maps (32GB, ~170K SPS)
  • Verify render works at scale (job running)
  • Full production training run

HEAVILY LLM generated, just a WIP

WaelDLZ and others added 9 commits January 16, 2026 18:23
Add a Flag to build_ocean so Raylib can work on Debian 11
Environments sharing the same map binary now share read-only road elements,
grid maps, and neighbor caches via a reference-counted SharedMapData cache.
This eliminates duplicate allocations (~36-73 MB per CARLA map) across envs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Walk pufferlib/ in Python and create per-file symlinks, skipping
resources/drive/binaries (60K+ map files). Symlink that dir as a
single entry instead. Removes dependency on rsync/cp -rs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After fork, child processes inherit g_map_cache pointers from
the parent. Calling free_shared_map_data on these corrupts the
heap since the memory belongs to the parent's address space.
Track the creating PID and skip freeing if PID doesn't match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds NULL checks and bounds checks with stderr output to identify
the root cause of worker segfaults after fork.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ocess

The bug: each PufferDrive.__init__ calls binding.shared() which freed
and rebuilt g_map_cache. When multiple PufferDrive instances exist in
one process (Serial workers), earlier instances' Drive structs had
shared_map pointers to freed SharedMapData, causing use-after-free
crashes in checkNeighbors/compute_agent_metrics.

Fix: only rebuild the cache after fork (PID mismatch) or on first call.
Same-process calls reuse the existing cache, which is correct since the
map data doesn't change — only agent-to-map assignment varies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 18, 2026 01:20
@eugenevinitsky eugenevinitsky changed the base branch from main to 3.0 March 18, 2026 01:22
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a broad “infrastructure + tooling” update centered on reducing memory usage and improving robustness when running multiple PufferDrive instances (notably via a shared map cache), while also adding cluster submission utilities and model export/rendering helpers.

Changes:

  • Introduces a module-level shared map cache in the Drive C binding and wires cache release into the Python env lifecycle.
  • Adds SLURM/Submitit cluster launch scripts + container setup helpers, and updates docs to match.
  • Expands WOSAC evaluation plumbing/metrics and adds ONNX / .bin export utilities.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

Show a summary per file
File Description
visualize.dSYM/Contents/Resources/Relocations/aarch64/visualize.yml Adds macOS dSYM relocation metadata for visualize (build artifact).
visualize.dSYM/Contents/Info.plist Adds macOS dSYM plist for visualize (build artifact).
setup.py Adds tqdm to training-related install requirements.
scripts/verify_onnx.py New script to sanity-check ONNX model structure and inference with dummy inputs.
scripts/train_sanity.sh Removes legacy training helper script.
scripts/train_procgen.sh Removes legacy training helper script.
scripts/train_ocean.sh Removes legacy training helper script.
scripts/train_atari.sh Removes legacy training helper script.
scripts/sweep_atari.sh Removes legacy sweep helper script.
scripts/submit_cluster.py New Submitit-based SLURM launcher with code isolation + optional Singularity wrapping.
scripts/setup_container.sh New helper to create/install/rebuild a Singularity overlay for cluster runs.
scripts/export_onnx.py New utility to export a trained policy checkpoint to ONNX + verify with ORT.
scripts/export_model_bin.py New utility to export policy weights to a flat .bin for the C backend.
scripts/cluster_status.sh New helper to summarize SLURM partition/node availability + user jobs.
scripts/cluster_configs/train_base.yaml Adds a baseline cluster training YAML for program args.
scripts/cluster_configs/nyu_greene.yaml Adds a baseline NYU Greene compute YAML for SLURM resources.
scripts/build_simple.sh Removes a generic C build helper script.
pyproject.toml Adds a cluster extra (submitit/pyyaml).
pufferlib/utils.py Updates WOSAC subprocess invocation args and refactors/extends video rendering pipeline (incl. async support hooks).
pufferlib/pufferl.py Adds async rendering support, truncation handling changes, logger init changes for distributed, adds render CLI mode.
pufferlib/ocean/torch.py Switches Drive model’s ego dimension source to env.ego_features.
pufferlib/ocean/env_config.h Extends Drive env init config parsing with many new fields (reward conditioning/randomization, spawn settings, etc.).
pufferlib/ocean/env_binding.h Updates trajectory extraction bindings (scenario IDs as strings, adds vehicle/track-to-predict flags).
pufferlib/ocean/drive/visualize.c Updates visualization to use new agent structures/config parsing and adds more CLI/config integration.
pufferlib/ocean/drive/error.h Extends error enum/strings and changes error formatting.
pufferlib/ocean/drive/drivenet.h Updates drivenet init signature and adjusts road feature layout handling.
pufferlib/ocean/drive/drive.py Large env API/config expansion, resampling refactor, scenario-id propagation, and cache release on close.
pufferlib/ocean/drive/drive.c Refactors demo/CLI parsing and integrates INI config parsing for the C demo entrypoint.
pufferlib/ocean/drive/datatypes.h New shared constants/types/helpers for reward conditioning, road/agent types, and cleanup helpers.
pufferlib/ocean/drive/binding.c Implements shared map cache + new shared-map selection logic and extends env init kwargs handling.
pufferlib/ocean/benchmark/wosac.ini Updates WOSAC metric weights for 2024-vs-2025 differences.
pufferlib/ocean/benchmark/visual_sanity_check.py Adjusts WOSAC visual sanity check config wiring.
pufferlib/ocean/benchmark/metrics_sanity_check.py Switches sanity check to use a random baseline rollout path.
pufferlib/ocean/benchmark/evaluator.py Adds batched evaluation loop with progress reporting + changes how likelihood/meta-metrics are aggregated.
pufferlib/ocean/benchmark/evaluate_imported_trajectories.py Replaces KDTree alignment with (scenario_id, agent_id) alignment.
pufferlib/config/ocean/drive.ini Major config expansion (reward conditioning/randomization, spawn settings, render config, WOSAC batch params, etc.).
drive.dSYM/Contents/Resources/Relocations/aarch64/drive.yml Adds macOS dSYM relocation metadata for drive (build artifact).
drive.dSYM/Contents/Info.plist Adds macOS dSYM plist for drive (build artifact).
docs/theme/extra.css Adjusts table styling (removes alternating row backgrounds).
docs/src/wosac.md Updates baseline table/results description and adds clarification of baselines.
docs/src/visualizer.md Expands docs for rendering mode + CLI flags and puffer render.
docs/src/train.md Reformats and updates training docs content.
docs/src/simulator.md Clarifies control modes and adds important notes about expert/static agent behavior.
docs/src/pufferdrive-2.0.md Updates author list + citation block.
docs/src/interact-with-agents.md Adds CLI argument documentation for drive tool.
docs/src/export-onnx.md New documentation for ONNX export and .bin weight export.
docs/src/data.md Updates troubleshooting message for missing .bin maps.
docs/src/cluster.md New end-to-end documentation for SLURM + container-based cluster training.
docs/src/SUMMARY.md Adds new docs pages (cluster + ONNX export) to nav.
data_utils/carla/generate_carla_agents.py Fixes heading/velocity computation and changes defaults/logging for dataset generation.
README.md Adds CI badge and updates citation author list.
CLAUDE.local.md Adds local cluster notes (developer-specific ops doc).
.github/workflows/utest.yml Changes CI triggers from main to 2.0.
.github/workflows/train-ci.yml Changes CI triggers from main to 2.0.
.github/workflows/render-ci.yml Changes CI triggers from main to 2.0.
.github/workflows/perf-ci.yml Changes CI triggers from main to 2.0.
.github/workflows/docs.yml Changes docs deploy branch from main to 2.0.
Comments suppressed due to low confidence (2)

pufferlib/ocean/drive/binding.c:335

  • total_agent_count is always clamped to num_agents even when use_all_maps is true. In use_all_maps mode the returned agent_offsets[-1] should reflect the full active-agent count across all maps; clamping will silently truncate agents and produce incorrect offsets/map_ids. Restore the previous !use_all_maps guard or pass an explicit limit variable for the non-use_all_maps case.
        total_agent_count = num_agents;
    }

scripts/submit_cluster.py:231

  • cpus_per_task is computed with integer division: from_config.get('cpus', 8) // args.task_per_node. If task_per_node exceeds cpus, this becomes 0 and SLURM will reject the job; if it's not divisible, you may under-allocate CPUs. Clamp to at least 1 and/or validate that cpus >= task_per_node (and maybe require divisibility) with a clear error message.
        slurm_account=from_config.get("account"),
        slurm_partition=from_config.get("partition"),
        cpus_per_task=from_config.get("cpus", 8) // args.task_per_node,
        tasks_per_node=args.task_per_node,
        nodes=from_config.get("nodes", 1),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants