WIP: shared map cache by eugenevinitsky · Pull Request #346 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-03-18T01:20:58Z

Summary

Adds shared map infrastructure to reduce memory usage by caching and reusing map data (road elements, grid maps, neighbor caches) across PufferDrive instances
Fixes use-after-free segfault when multiple PufferDrive instances in the same worker process shared a map cache that got rebuilt
Skips binaries dir in code isolation to avoid slow NFS walks on cluster

Test plan

Verified fix at scale: 96 envs × 16 workers × 1024 agents on CARLA 3D maps (32GB, ~170K SPS)
Verify render works at scale (job running)
Full production training run

HEAVILY LLM generated, just a WIP

Add a Flag to build_ocean so Raylib can work on Debian 11

Environments sharing the same map binary now share read-only road elements, grid maps, and neighbor caches via a reference-counted SharedMapData cache. This eliminates duplicate allocations (~36-73 MB per CARLA map) across envs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Walk pufferlib/ in Python and create per-file symlinks, skipping resources/drive/binaries (60K+ map files). Symlink that dir as a single entry instead. Removes dependency on rsync/cp -rs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

After fork, child processes inherit g_map_cache pointers from the parent. Calling free_shared_map_data on these corrupts the heap since the memory belongs to the parent's address space. Track the creating PID and skip freeing if PID doesn't match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds NULL checks and bounds checks with stderr output to identify the root cause of worker segfaults after fork. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ocess The bug: each PufferDrive.__init__ calls binding.shared() which freed and rebuilt g_map_cache. When multiple PufferDrive instances exist in one process (Serial workers), earlier instances' Drive structs had shared_map pointers to freed SharedMapData, causing use-after-free crashes in checkNeighbors/compute_agent_metrics. Fix: only rebuild the cache after fork (PID mismatch) or on first call. Same-process calls reuse the existing cache, which is correct since the map data doesn't change — only agent-to-map assignment varies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR is a broad “infrastructure + tooling” update centered on reducing memory usage and improving robustness when running multiple PufferDrive instances (notably via a shared map cache), while also adding cluster submission utilities and model export/rendering helpers.

Changes:

Introduces a module-level shared map cache in the Drive C binding and wires cache release into the Python env lifecycle.
Adds SLURM/Submitit cluster launch scripts + container setup helpers, and updates docs to match.
Expands WOSAC evaluation plumbing/metrics and adds ONNX / .bin export utilities.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
visualize.dSYM/Contents/Resources/Relocations/aarch64/visualize.yml	Adds macOS dSYM relocation metadata for `visualize` (build artifact).
visualize.dSYM/Contents/Info.plist	Adds macOS dSYM plist for `visualize` (build artifact).
setup.py	Adds `tqdm` to training-related install requirements.
scripts/verify_onnx.py	New script to sanity-check ONNX model structure and inference with dummy inputs.
scripts/train_sanity.sh	Removes legacy training helper script.
scripts/train_procgen.sh	Removes legacy training helper script.
scripts/train_ocean.sh	Removes legacy training helper script.
scripts/train_atari.sh	Removes legacy training helper script.
scripts/sweep_atari.sh	Removes legacy sweep helper script.
scripts/submit_cluster.py	New Submitit-based SLURM launcher with code isolation + optional Singularity wrapping.
scripts/setup_container.sh	New helper to create/install/rebuild a Singularity overlay for cluster runs.
scripts/export_onnx.py	New utility to export a trained policy checkpoint to ONNX + verify with ORT.
scripts/export_model_bin.py	New utility to export policy weights to a flat `.bin` for the C backend.
scripts/cluster_status.sh	New helper to summarize SLURM partition/node availability + user jobs.
scripts/cluster_configs/train_base.yaml	Adds a baseline cluster training YAML for program args.
scripts/cluster_configs/nyu_greene.yaml	Adds a baseline NYU Greene compute YAML for SLURM resources.
scripts/build_simple.sh	Removes a generic C build helper script.
pyproject.toml	Adds a `cluster` extra (submitit/pyyaml).
pufferlib/utils.py	Updates WOSAC subprocess invocation args and refactors/extends video rendering pipeline (incl. async support hooks).
pufferlib/pufferl.py	Adds async rendering support, truncation handling changes, logger init changes for distributed, adds `render` CLI mode.
pufferlib/ocean/torch.py	Switches Drive model’s ego dimension source to `env.ego_features`.
pufferlib/ocean/env_config.h	Extends Drive env init config parsing with many new fields (reward conditioning/randomization, spawn settings, etc.).
pufferlib/ocean/env_binding.h	Updates trajectory extraction bindings (scenario IDs as strings, adds vehicle/track-to-predict flags).
pufferlib/ocean/drive/visualize.c	Updates visualization to use new agent structures/config parsing and adds more CLI/config integration.
pufferlib/ocean/drive/error.h	Extends error enum/strings and changes error formatting.
pufferlib/ocean/drive/drivenet.h	Updates drivenet init signature and adjusts road feature layout handling.
pufferlib/ocean/drive/drive.py	Large env API/config expansion, resampling refactor, scenario-id propagation, and cache release on close.
pufferlib/ocean/drive/drive.c	Refactors demo/CLI parsing and integrates INI config parsing for the C demo entrypoint.
pufferlib/ocean/drive/datatypes.h	New shared constants/types/helpers for reward conditioning, road/agent types, and cleanup helpers.
pufferlib/ocean/drive/binding.c	Implements shared map cache + new shared-map selection logic and extends env init kwargs handling.
pufferlib/ocean/benchmark/wosac.ini	Updates WOSAC metric weights for 2024-vs-2025 differences.
pufferlib/ocean/benchmark/visual_sanity_check.py	Adjusts WOSAC visual sanity check config wiring.
pufferlib/ocean/benchmark/metrics_sanity_check.py	Switches sanity check to use a random baseline rollout path.
pufferlib/ocean/benchmark/evaluator.py	Adds batched evaluation loop with progress reporting + changes how likelihood/meta-metrics are aggregated.
pufferlib/ocean/benchmark/evaluate_imported_trajectories.py	Replaces KDTree alignment with (scenario_id, agent_id) alignment.
pufferlib/config/ocean/drive.ini	Major config expansion (reward conditioning/randomization, spawn settings, render config, WOSAC batch params, etc.).
drive.dSYM/Contents/Resources/Relocations/aarch64/drive.yml	Adds macOS dSYM relocation metadata for `drive` (build artifact).
drive.dSYM/Contents/Info.plist	Adds macOS dSYM plist for `drive` (build artifact).
docs/theme/extra.css	Adjusts table styling (removes alternating row backgrounds).
docs/src/wosac.md	Updates baseline table/results description and adds clarification of baselines.
docs/src/visualizer.md	Expands docs for rendering mode + CLI flags and `puffer render`.
docs/src/train.md	Reformats and updates training docs content.
docs/src/simulator.md	Clarifies control modes and adds important notes about expert/static agent behavior.
docs/src/pufferdrive-2.0.md	Updates author list + citation block.
docs/src/interact-with-agents.md	Adds CLI argument documentation for `drive` tool.
docs/src/export-onnx.md	New documentation for ONNX export and `.bin` weight export.
docs/src/data.md	Updates troubleshooting message for missing `.bin` maps.
docs/src/cluster.md	New end-to-end documentation for SLURM + container-based cluster training.
docs/src/SUMMARY.md	Adds new docs pages (cluster + ONNX export) to nav.
data_utils/carla/generate_carla_agents.py	Fixes heading/velocity computation and changes defaults/logging for dataset generation.
README.md	Adds CI badge and updates citation author list.
CLAUDE.local.md	Adds local cluster notes (developer-specific ops doc).
.github/workflows/utest.yml	Changes CI triggers from `main` to `2.0`.
.github/workflows/train-ci.yml	Changes CI triggers from `main` to `2.0`.
.github/workflows/render-ci.yml	Changes CI triggers from `main` to `2.0`.
.github/workflows/perf-ci.yml	Changes CI triggers from `main` to `2.0`.
.github/workflows/docs.yml	Changes docs deploy branch from `main` to `2.0`.

Comments suppressed due to low confidence (2)

pufferlib/ocean/drive/binding.c:335

total_agent_count is always clamped to num_agents even when use_all_maps is true. In use_all_maps mode the returned agent_offsets[-1] should reflect the full active-agent count across all maps; clamping will silently truncate agents and produce incorrect offsets/map_ids. Restore the previous !use_all_maps guard or pass an explicit limit variable for the non-use_all_maps case.

        total_agent_count = num_agents;
    }

scripts/submit_cluster.py:231

cpus_per_task is computed with integer division: from_config.get('cpus', 8) // args.task_per_node. If task_per_node exceeds cpus, this becomes 0 and SLURM will reject the job; if it's not divisible, you may under-allocate CPUs. Clamp to at least 1 and/or validate that cpus >= task_per_node (and maybe require divisibility) with a clear error message.

        slurm_account=from_config.get("account"),
        slurm_partition=from_config.get("partition"),
        cpus_per_task=from_config.get("cpus", 8) // args.task_per_node,
        tasks_per_node=args.task_per_node,
        nodes=from_config.get("nodes", 1),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

WaelDLZ and others added 9 commits January 16, 2026 18:23

Add a Flag to build_ocean so Raylib can work on Debian 11

ced67e2

Merge pull request #263 from Emerge-Lab/wbd/debian_issue

77898d1

Add a Flag to build_ocean so Raylib can work on Debian 11

Merge remote-tracking branch 'origin/main' into ev/reuse_maps

b9a76d4

Merge remote-tracking branch 'origin/3.0' into ev/reuse_maps

275e45c

Add debug prints to trace segfault in move_dynamics/c_step

4d78b82

Adds NULL checks and bounds checks with stderr output to identify the root cause of worker segfaults after fork. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 18, 2026 01:20

Copilot started reviewing on behalf of eugenevinitsky March 18, 2026 01:21 View session

eugenevinitsky changed the base branch from main to 3.0 March 18, 2026 01:22

Apply clang-format to C files

c2fa8ab

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: shared map cache#346

WIP: shared map cache#346
eugenevinitsky wants to merge 10 commits into3.0from
ev/reuse_maps

eugenevinitsky commented Mar 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eugenevinitsky commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eugenevinitsky commented Mar 18, 2026 •

edited

Loading