Draft: merge any model calc one block scores#994
Draft: merge any model calc one block scores#994danielkorzekwa merged 43 commits intofeature/puzzletronfrom
Conversation
- Add converter, model_descriptor, puzzformer, and llama model support - Selective merge of anymodel functionality Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…s merged) Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
| - ``replacement_library_path`` (Path): Path to the replacement library JSON file. | ||
| - ``solutions_path`` (Path): Path to puzzle solutions JSON file or directory containing solution files. | ||
| - ``solutions_to_validate`` (list[int], optional): Indices of specific solutions to validate. | ||
| Validates all solutions if None. |
There was a problem hiding this comment.
Fyi, I think this reverts previously fixed docstrings. If you see doc building failures - this might be it
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…ld_library_and_stats
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…kwa/any_model_calc_one_block_scores
📝 WalkthroughWalkthroughThe changes introduce descriptor-aware model handling throughout the puzzletron pipeline, including new AnyModel-based loading paths, distributed checkpoint loading via load_and_shard_model, and corresponding propagation of descriptors across validation and checkpoint utilities. The distributed scoring step is also enabled in the main workflow. Changes
Sequence Diagram(s)sequenceDiagram
participant Main as Validation Main
participant MDF as ModelDescriptorFactory
participant RL as ReplacementLibrary
participant LSM as load_and_shard_model
participant PS as perform_pipeline_stitches
participant SC as save_checkpoint
Main->>MDF: get(args.descriptor)
MDF-->>Main: descriptor
Main->>LSM: load_and_shard_model(teacher_path, descriptor)
LSM-->>Main: teacher_model (sharded)
Main->>RL: __init__(puzzle_path, descriptor, model_config_overrides)
RL-->>Main: replacement_library instance
Main->>RL: load_model(layer_replacements)
RL-->>Main: solution_model (PreTrainedModel)
Main->>PS: perform_pipeline_stitches(model, descriptor)
PS-->>Main: stitched_model
Main->>SC: save_checkpoint(model, checkpoint_dir, descriptor)
SC-->>Main: checkpoint saved
Main->>Main: free GPU memory (CPU transfer + CUDA sync)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…ld_library_and_stats
…kwa/any_model_calc_one_block_scores
…lock_scores Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py (1)
266-277:⚠️ Potential issue | 🔴 CriticalHardcoded
trust_remote_code=Trueis a security risk.The coding guidelines explicitly prohibit hardcoding
trust_remote_code=Trueas it "executes arbitrary Python shipped with a checkpoint and creates an RCE vector." The tokenizer loading should expose this as a caller-configurable parameter defaulting toFalse.🔒 Proposed fix to make trust_remote_code configurable
def _load_tokenizer(args: DictConfig) -> PreTrainedTokenizerBase: tokenizer = None + trust_remote_code = getattr(args, "trust_remote_code", False) if (tokenizer_name := getattr(args, "tokenizer_name", None)) is not None: - tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True) + tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=trust_remote_code) elif args.teacher_dir is not None: try: - tokenizer = AutoTokenizer.from_pretrained(args.teacher_dir, trust_remote_code=True) + tokenizer = AutoTokenizer.from_pretrained(args.teacher_dir, trust_remote_code=trust_remote_code) except: pass if tokenizer is None: warnings.warn("Couldn't find a tokenizer, trying to continue without one") return tokenizerAs per coding guidelines: "Do not hardcode
trust_remote_code=Truewhen loading transformers models. Let the caller decide via a parameter with default valueFalse, as this flag executes arbitrary Python shipped with a checkpoint and creates an RCE vector."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py` around lines 266 - 277, The _load_tokenizer function currently hardcodes trust_remote_code=True; change it to read a boolean flag (default False) from the caller/config and pass that into AutoTokenizer.from_pretrained instead. Specifically, use getattr(args, "trust_remote_code", False) (or accept an explicit parameter) and replace both AutoTokenizer.from_pretrained(..., trust_remote_code=True) calls with trust_remote_code=trust_remote_code so the caller can opt in; keep the existing tokenizer_name and teacher_dir fallbacks and the try/except behavior unchanged.
🧹 Nitpick comments (3)
modelopt/torch/puzzletron/replacement_library/replacement_library.py (1)
366-369: Consider addingweights_only=Truefor consistency.While PyTorch >= 2.6 defaults to safe loading, the codebase elsewhere (line 257) explicitly uses
weights_only=True. Consider adding it here for consistency and defense-in-depth.♻️ Suggested change
non_block_pth_path = checkpoint_dir / PTH_SUBBLOCKS_DIR_NAME / f"non_block.pth" assert non_block_pth_path.exists(), _error_message_ensure_split(checkpoint_dir) - non_block_state_dict = torch.load(non_block_pth_path) + non_block_state_dict = torch.load(non_block_pth_path, weights_only=True) return non_block_state_dict[param_name]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/replacement_library/replacement_library.py` around lines 366 - 369, The torch.load call that loads non_block.pth (using non_block_pth_path) should use weights_only=True for consistency with the other load at line 257; update the torch.load(non_block_pth_path) invocation to torch.load(non_block_pth_path, weights_only=True) so the returned non_block_state_dict[param_name] is loaded with the same safe/weights-only semantics as elsewhere.modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py (2)
189-195: Clarify the no-op symlink save path.The
if realizable_as_symlinksblock at lines 191-194 contains only apassstatement with a comment that the feature is "not supported." If this path is intentionally disabled, consider removing the dead branch or adding a TODO for future implementation.♻️ Remove dead code or add TODO
model_config.dtype = getattr(args, "model_dtype", "torch.bfloat16") Converter.copy_checkpoint_files(args.teacher_dir, checkpoint_dir) - if realizable_as_symlinks: - if dist.is_master(): - # save_checkpoint_as_symlinks is currently not supported - pass + # TODO: Re-enable save_checkpoint_as_symlinks when descriptor-aware symlink saving is implemented save_checkpoint(model, checkpoint_dir, descriptor)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py` around lines 189 - 195, The realizable_as_symlinks branch is a dead no-op; either remove the entire if realizable_as_symlinks: ... pass block or replace the pass with an explicit TODO and a descriptive log (e.g., using dist.is_master() to log that save_checkpoint_as_symlinks is not yet implemented) so the intent is clear; ensure save_checkpoint(model, checkpoint_dir, descriptor) and Converter.copy_checkpoint_files(args.teacher_dir, checkpoint_dir) remain unchanged and that any future implementation would call save_checkpoint_as_symlinks when supported.
176-177: Remove or document the commented-out line.Line 177 contains a commented-out
realizable_as_symlinks = False. This appears to be debug/test code. Either remove it or add a comment explaining why it's kept for future reference.♻️ Remove commented debug code
realizable_as_symlinks = can_realize_as_symlinks(layer_replacements) - # realizable_as_symlinks = False model_config = replacement_library.create_model_config(layer_replacements)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py` around lines 176 - 177, The commented-out debug assignment "realizable_as_symlinks = False" should be removed or documented: either delete that dead code line or replace it with a short clarifying comment explaining why a hard override would ever be kept (e.g., for manual testing), referencing the nearby variables and calls (realizable_as_symlinks and can_realize_as_symlinks in validate_puzzle_with_multi_replacements.py) so future readers know whether the value must come from can_realize_as_symlinks or is intentionally forced; make the change in the block where realizable_as_symlinks is assigned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`:
- Around line 266-277: The _load_tokenizer function currently hardcodes
trust_remote_code=True; change it to read a boolean flag (default False) from
the caller/config and pass that into AutoTokenizer.from_pretrained instead.
Specifically, use getattr(args, "trust_remote_code", False) (or accept an
explicit parameter) and replace both AutoTokenizer.from_pretrained(...,
trust_remote_code=True) calls with trust_remote_code=trust_remote_code so the
caller can opt in; keep the existing tokenizer_name and teacher_dir fallbacks
and the try/except behavior unchanged.
---
Nitpick comments:
In `@modelopt/torch/puzzletron/replacement_library/replacement_library.py`:
- Around line 366-369: The torch.load call that loads non_block.pth (using
non_block_pth_path) should use weights_only=True for consistency with the other
load at line 257; update the torch.load(non_block_pth_path) invocation to
torch.load(non_block_pth_path, weights_only=True) so the returned
non_block_state_dict[param_name] is loaded with the same safe/weights-only
semantics as elsewhere.
In `@modelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.py`:
- Around line 189-195: The realizable_as_symlinks branch is a dead no-op; either
remove the entire if realizable_as_symlinks: ... pass block or replace the pass
with an explicit TODO and a descriptive log (e.g., using dist.is_master() to log
that save_checkpoint_as_symlinks is not yet implemented) so the intent is clear;
ensure save_checkpoint(model, checkpoint_dir, descriptor) and
Converter.copy_checkpoint_files(args.teacher_dir, checkpoint_dir) remain
unchanged and that any future implementation would call
save_checkpoint_as_symlinks when supported.
- Around line 176-177: The commented-out debug assignment
"realizable_as_symlinks = False" should be removed or documented: either delete
that dead code line or replace it with a short clarifying comment explaining why
a hard override would ever be kept (e.g., for manual testing), referencing the
nearby variables and calls (realizable_as_symlinks and can_realize_as_symlinks
in validate_puzzle_with_multi_replacements.py) so future readers know whether
the value must come from can_realize_as_symlinks or is intentionally forced;
make the change in the block where realizable_as_symlinks is assigned.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f1ec14d9-bed5-423f-a9eb-7c8108c79a41
📒 Files selected for processing (4)
modelopt/torch/puzzletron/puzzletron.pymodelopt/torch/puzzletron/replacement_library/replacement_library.pymodelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.pymodelopt/torch/puzzletron/tools/validation_utils.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## feature/puzzletron #994 +/- ##
===================================================
Coverage 72.12% 72.12%
===================================================
Files 209 209
Lines 23628 23628
===================================================
Hits 17042 17042
Misses 6586 6586 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Merging dkorzekwa/any_model_calc_one_block_scores into dkorzekwa/anymodel_build_library_and_stats - this MR is only for reviewing. Ultimately dkorzekwa/any_model_calc_one_block_scores should be merged into feature/puzzletron once dkorzekwa/anymodel_build_library_and_stats is merged there.
Summary by CodeRabbit
Release Notes
New Features
Improvements