Skip to content

Draft: merge any model calc one block scores#994

Merged
danielkorzekwa merged 43 commits intofeature/puzzletronfrom
dkorzekwa/any_model_calc_one_block_scores
Mar 12, 2026
Merged

Draft: merge any model calc one block scores#994
danielkorzekwa merged 43 commits intofeature/puzzletronfrom
dkorzekwa/any_model_calc_one_block_scores

Conversation

@danielkorzekwa
Copy link

@danielkorzekwa danielkorzekwa commented Mar 6, 2026

What does this PR do?

Merging dkorzekwa/any_model_calc_one_block_scores into dkorzekwa/anymodel_build_library_and_stats - this MR is only for reviewing. Ultimately dkorzekwa/any_model_calc_one_block_scores should be merged into feature/puzzletron once dkorzekwa/anymodel_build_library_and_stats is merged there.

Summary by CodeRabbit

Release Notes

  • New Features

    • Enabled distributed scoring step in puzzletron workflow for improved performance.
  • Improvements

    • Extended model loading support to accommodate additional model formats and configurations.
    • Optimized GPU memory management during validation with improved cache synchronization.
    • Streamlined validation function signatures for easier usage.

- Add converter, model_descriptor, puzzformer, and llama model support
- Selective merge of anymodel functionality

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…s merged)

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
@danielkorzekwa danielkorzekwa requested a review from a team as a code owner March 6, 2026 14:00
@danielkorzekwa danielkorzekwa changed the title Dkorzekwa/any model calc one block scores Draft: merge any model calc one block scores Mar 6, 2026
- ``replacement_library_path`` (Path): Path to the replacement library JSON file.
- ``solutions_path`` (Path): Path to puzzle solutions JSON file or directory containing solution files.
- ``solutions_to_validate`` (list[int], optional): Indices of specific solutions to validate.
Validates all solutions if None.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fyi, I think this reverts previously fixed docstrings. If you see doc building failures - this might be it

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

The changes introduce descriptor-aware model handling throughout the puzzletron pipeline, including new AnyModel-based loading paths, distributed checkpoint loading via load_and_shard_model, and corresponding propagation of descriptors across validation and checkpoint utilities. The distributed scoring step is also enabled in the main workflow.

Changes

Cohort / File(s) Summary
Main Puzzletron Scoring
modelopt/torch/puzzletron/puzzletron.py
Uncomments and activates the distributed scoring step (Step 5) by removing comment markers from scoring.launch_scoring(hydra_cfg) call.
ReplacementLibrary Core Refactoring
modelopt/torch/puzzletron/replacement_library/replacement_library.py
Introduces descriptor-based initialization parameter; replaces direct model_config_overrides with immutabledict-wrapped storage; adds AnyModel-based loading paths with new methods for checkpoint handling, weight indexing, and symlink preparation; refactors load_model and load_checkpoint to return PreTrainedModel instead of DeciLM-specific classes; updates state-dict assembly logic and model configuration handling with deepcopy and block config updates; expands imports (copy, tempfile, safetensors, Converter, etc.).
Validation Pipeline Integration
modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py
Integrates ModelDescriptorFactory and descriptor-aware workflow; replaces direct checkpoint loading with load_and_shard_model for distributed/sharded loading; propagates descriptor through perform_pipeline_stitches, save_checkpoint, and validation steps; updates ReplacementLibrary initialization to accept descriptor and model_config_overrides; adds GPU memory management (CPU transfer and CUDA cache synchronization); introduces dynamic model_config.dtype assignment via getattr; updates multiple function signatures and adds imports for Converter and ModelDescriptorFactory.
Validation Utilities
modelopt/torch/puzzletron/tools/validation_utils.py
Expands typing imports to include Optional and Union; updates extra_payload parameter type from dict[str, Any] | None to Optional[dict[str, Any]] in two public functions; removes pipeline_parallel parameter from validate_model_and_extract_hidden_states and validate_model_with_teacher_similarity_metrics function signatures and call sites.

Sequence Diagram(s)

sequenceDiagram
    participant Main as Validation Main
    participant MDF as ModelDescriptorFactory
    participant RL as ReplacementLibrary
    participant LSM as load_and_shard_model
    participant PS as perform_pipeline_stitches
    participant SC as save_checkpoint

    Main->>MDF: get(args.descriptor)
    MDF-->>Main: descriptor
    
    Main->>LSM: load_and_shard_model(teacher_path, descriptor)
    LSM-->>Main: teacher_model (sharded)
    
    Main->>RL: __init__(puzzle_path, descriptor, model_config_overrides)
    RL-->>Main: replacement_library instance
    
    Main->>RL: load_model(layer_replacements)
    RL-->>Main: solution_model (PreTrainedModel)
    
    Main->>PS: perform_pipeline_stitches(model, descriptor)
    PS-->>Main: stitched_model
    
    Main->>SC: save_checkpoint(model, checkpoint_dir, descriptor)
    SC-->>Main: checkpoint saved
    
    Main->>Main: free GPU memory (CPU transfer + CUDA sync)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error Pull request introduces critical security violations: hardcoded trust_remote_code=True without configurable parameters, multiple # nosec comments to bypass Bandit checks, and torch.load() calls with weights_only=False lacking justification. Remove hardcoded trust_remote_code=True and make configurable defaulting to False. Remove all # nosec comments. Add inline security justification comments for torch.load() calls with weights_only=False.
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title mentions 'any model calc one block scores' which aligns with the PR objectives, but is vague and uses the 'Draft' prefix, making it unclear as a final commit summary. Replace with a more descriptive title summarizing the main changes, such as 'Enable distributed scoring and descriptor-based model loading in puzzletron' or similar once the PR moves out of draft status.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dkorzekwa/any_model_calc_one_block_scores
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Base automatically changed from dkorzekwa/anymodel_build_library_and_stats to feature/puzzletron March 12, 2026 16:51
@danielkorzekwa danielkorzekwa requested a review from a team as a code owner March 12, 2026 16:51
…lock_scores

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py (1)

266-277: ⚠️ Potential issue | 🔴 Critical

Hardcoded trust_remote_code=True is a security risk.

The coding guidelines explicitly prohibit hardcoding trust_remote_code=True as it "executes arbitrary Python shipped with a checkpoint and creates an RCE vector." The tokenizer loading should expose this as a caller-configurable parameter defaulting to False.

🔒 Proposed fix to make trust_remote_code configurable
 def _load_tokenizer(args: DictConfig) -> PreTrainedTokenizerBase:
     tokenizer = None
+    trust_remote_code = getattr(args, "trust_remote_code", False)
     if (tokenizer_name := getattr(args, "tokenizer_name", None)) is not None:
-        tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
+        tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=trust_remote_code)
     elif args.teacher_dir is not None:
         try:
-            tokenizer = AutoTokenizer.from_pretrained(args.teacher_dir, trust_remote_code=True)
+            tokenizer = AutoTokenizer.from_pretrained(args.teacher_dir, trust_remote_code=trust_remote_code)
         except:
             pass
     if tokenizer is None:
         warnings.warn("Couldn't find a tokenizer, trying to continue without one")
     return tokenizer

As per coding guidelines: "Do not hardcode trust_remote_code=True when loading transformers models. Let the caller decide via a parameter with default value False, as this flag executes arbitrary Python shipped with a checkpoint and creates an RCE vector."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`
around lines 266 - 277, The _load_tokenizer function currently hardcodes
trust_remote_code=True; change it to read a boolean flag (default False) from
the caller/config and pass that into AutoTokenizer.from_pretrained instead.
Specifically, use getattr(args, "trust_remote_code", False) (or accept an
explicit parameter) and replace both AutoTokenizer.from_pretrained(...,
trust_remote_code=True) calls with trust_remote_code=trust_remote_code so the
caller can opt in; keep the existing tokenizer_name and teacher_dir fallbacks
and the try/except behavior unchanged.
🧹 Nitpick comments (3)
modelopt/torch/puzzletron/replacement_library/replacement_library.py (1)

366-369: Consider adding weights_only=True for consistency.

While PyTorch >= 2.6 defaults to safe loading, the codebase elsewhere (line 257) explicitly uses weights_only=True. Consider adding it here for consistency and defense-in-depth.

♻️ Suggested change
         non_block_pth_path = checkpoint_dir / PTH_SUBBLOCKS_DIR_NAME / f"non_block.pth"
         assert non_block_pth_path.exists(), _error_message_ensure_split(checkpoint_dir)
-        non_block_state_dict = torch.load(non_block_pth_path)
+        non_block_state_dict = torch.load(non_block_pth_path, weights_only=True)
         return non_block_state_dict[param_name]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/replacement_library/replacement_library.py` around
lines 366 - 369, The torch.load call that loads non_block.pth (using
non_block_pth_path) should use weights_only=True for consistency with the other
load at line 257; update the torch.load(non_block_pth_path) invocation to
torch.load(non_block_pth_path, weights_only=True) so the returned
non_block_state_dict[param_name] is loaded with the same safe/weights-only
semantics as elsewhere.
modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py (2)

189-195: Clarify the no-op symlink save path.

The if realizable_as_symlinks block at lines 191-194 contains only a pass statement with a comment that the feature is "not supported." If this path is intentionally disabled, consider removing the dead branch or adding a TODO for future implementation.

♻️ Remove dead code or add TODO
             model_config.dtype = getattr(args, "model_dtype", "torch.bfloat16")
             Converter.copy_checkpoint_files(args.teacher_dir, checkpoint_dir)
-            if realizable_as_symlinks:
-                if dist.is_master():
-                    # save_checkpoint_as_symlinks is currently not supported
-                    pass
+            # TODO: Re-enable save_checkpoint_as_symlinks when descriptor-aware symlink saving is implemented
             save_checkpoint(model, checkpoint_dir, descriptor)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`
around lines 189 - 195, The realizable_as_symlinks branch is a dead no-op;
either remove the entire if realizable_as_symlinks: ... pass block or replace
the pass with an explicit TODO and a descriptive log (e.g., using
dist.is_master() to log that save_checkpoint_as_symlinks is not yet implemented)
so the intent is clear; ensure save_checkpoint(model, checkpoint_dir,
descriptor) and Converter.copy_checkpoint_files(args.teacher_dir,
checkpoint_dir) remain unchanged and that any future implementation would call
save_checkpoint_as_symlinks when supported.

176-177: Remove or document the commented-out line.

Line 177 contains a commented-out realizable_as_symlinks = False. This appears to be debug/test code. Either remove it or add a comment explaining why it's kept for future reference.

♻️ Remove commented debug code
         realizable_as_symlinks = can_realize_as_symlinks(layer_replacements)
-        # realizable_as_symlinks = False
         model_config = replacement_library.create_model_config(layer_replacements)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`
around lines 176 - 177, The commented-out debug assignment
"realizable_as_symlinks = False" should be removed or documented: either delete
that dead code line or replace it with a short clarifying comment explaining why
a hard override would ever be kept (e.g., for manual testing), referencing the
nearby variables and calls (realizable_as_symlinks and can_realize_as_symlinks
in validate_puzzle_with_multi_replacements.py) so future readers know whether
the value must come from can_realize_as_symlinks or is intentionally forced;
make the change in the block where realizable_as_symlinks is assigned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`:
- Around line 266-277: The _load_tokenizer function currently hardcodes
trust_remote_code=True; change it to read a boolean flag (default False) from
the caller/config and pass that into AutoTokenizer.from_pretrained instead.
Specifically, use getattr(args, "trust_remote_code", False) (or accept an
explicit parameter) and replace both AutoTokenizer.from_pretrained(...,
trust_remote_code=True) calls with trust_remote_code=trust_remote_code so the
caller can opt in; keep the existing tokenizer_name and teacher_dir fallbacks
and the try/except behavior unchanged.

---

Nitpick comments:
In `@modelopt/torch/puzzletron/replacement_library/replacement_library.py`:
- Around line 366-369: The torch.load call that loads non_block.pth (using
non_block_pth_path) should use weights_only=True for consistency with the other
load at line 257; update the torch.load(non_block_pth_path) invocation to
torch.load(non_block_pth_path, weights_only=True) so the returned
non_block_state_dict[param_name] is loaded with the same safe/weights-only
semantics as elsewhere.

In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`:
- Around line 189-195: The realizable_as_symlinks branch is a dead no-op; either
remove the entire if realizable_as_symlinks: ... pass block or replace the pass
with an explicit TODO and a descriptive log (e.g., using dist.is_master() to log
that save_checkpoint_as_symlinks is not yet implemented) so the intent is clear;
ensure save_checkpoint(model, checkpoint_dir, descriptor) and
Converter.copy_checkpoint_files(args.teacher_dir, checkpoint_dir) remain
unchanged and that any future implementation would call
save_checkpoint_as_symlinks when supported.
- Around line 176-177: The commented-out debug assignment
"realizable_as_symlinks = False" should be removed or documented: either delete
that dead code line or replace it with a short clarifying comment explaining why
a hard override would ever be kept (e.g., for manual testing), referencing the
nearby variables and calls (realizable_as_symlinks and can_realize_as_symlinks
in validate_puzzle_with_multi_replacements.py) so future readers know whether
the value must come from can_realize_as_symlinks or is intentionally forced;
make the change in the block where realizable_as_symlinks is assigned.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f1ec14d9-bed5-423f-a9eb-7c8108c79a41

📥 Commits

Reviewing files that changed from the base of the PR and between 8e827f3 and 58a42ca.

📒 Files selected for processing (4)
  • modelopt/torch/puzzletron/puzzletron.py
  • modelopt/torch/puzzletron/replacement_library/replacement_library.py
  • modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py
  • modelopt/torch/puzzletron/tools/validation_utils.py

@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.12%. Comparing base (8e827f3) to head (58a42ca).
⚠️ Report is 1 commits behind head on feature/puzzletron.

Additional details and impacted files
@@                 Coverage Diff                 @@
##           feature/puzzletron     #994   +/-   ##
===================================================
  Coverage               72.12%   72.12%           
===================================================
  Files                     209      209           
  Lines                   23628    23628           
===================================================
  Hits                    17042    17042           
  Misses                   6586     6586           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@danielkorzekwa danielkorzekwa merged commit eb4b210 into feature/puzzletron Mar 12, 2026
28 checks passed
@danielkorzekwa danielkorzekwa deleted the dkorzekwa/any_model_calc_one_block_scores branch March 12, 2026 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants