Support megatron tokenization for post training datasets by kevalmorabia97 · Pull Request #1018 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-03-10T19:30:22Z

What does this PR do?

Update megatron_preprocess_data.py to support applying chat template for tokenizing chat based post training datasets

Usage

# Add a code snippet demonstrating how to use this

Testing

Tokenized Nemotron-Post-Training-Dataset-v2 (~2B tokens for stem + chat + math + code splits)
Doing distillation on pruned nano v2 7B

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: ❌
Did you update Changelog?: N/A

Summary by CodeRabbit

Bug Fixes
- Added runtime logging for data truncation operations to enhance processing visibility
- Improved handling of chat-formatted conversation data in list format
- Eliminated duplicate log messages during data encoding operations

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai · 2026-03-10T19:30:43Z

📝 Walkthrough

Walkthrough

Enhances megatron preprocessing by adding chat-template rendering for list-typed JSON keys, implementing runtime logging for document and sequence truncation events, reorganizing length accounting logic to reflect post-processing state, and introducing per-key tracking to prevent duplicate template log messages.

Changes

Cohort / File(s)	Summary
Megatron Preprocessing Enhancement `modelopt/torch/utils/plugins/megatron_preprocess_data.py`	Adds chat-template handling for list-valued JSON keys with per-key logging guard, introduces runtime logging for document/sequence truncation with original length reporting, reorganizes length accumulation to occur post-chat-template expansion and truncation, and introduces `_chat_template_logged` class-level tracking set to suppress duplicate template application messages.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding support for megatron tokenization specifically for post-training datasets, which aligns with the chat-template handling and improved logging for post-training dataset tokenization described in the summary.
Security Anti-Patterns	✅ Passed	PR changes to megatron_preprocess_data.py do not introduce security anti-patterns defined in SECURITY.md; code safely uses AutoTokenizer without hardcoding trust_remote_code=True and contains no unsafe serialization, code execution, or credential patterns.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch kmorabia/post-training-tokenization

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

modelopt/torch/utils/plugins/megatron_preprocess_data.py (2)

119-131: Consider handling unexpected value types gracefully.

The code assumes value is either a list (chat format) or a string (raw text). If a JSON key contains an unexpected type (e.g., dict, int, None), line 131 assigns it directly to text, which will fail at _Encoder.tokenizer.encode(text) with an unclear error.

Consider adding explicit type validation:

🛡️ Suggested defensive check

             if isinstance(value, list):
                 if key not in _Encoder._chat_template_logged:
                     _Encoder._chat_template_logged.add(key)
                     print(f"Applying chat_template to '{key}' key")
                 kwargs = {}
                 tools = data.get("tools")
                 if tools:
                     kwargs["tools"] = tools
                 text = _Encoder.tokenizer.apply_chat_template(value, tokenize=False, **kwargs)
-            else:
+            elif isinstance(value, str):
                 text = value
+            else:
+                raise TypeError(f"Expected list or str for key '{key}', got {type(value).__name__}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 119 -
131, The code assumes value is list or string and assigns non-string types
directly to text, causing failures later in _Encoder.tokenizer.encode(text);
update the handling in _Encoder (in the block around
tokenizer.apply_chat_template) to validate value types: if value is list handle
as before, if value is str use directly, if value is None/other primitive
convert to string (e.g., str(value)) or skip/log a warning (use
_Encoder._chat_template_logged or a logger) and ensure text is always a string
before calling _Encoder.tokenizer.encode; add a clear warning or raise a
controlled TypeError for unexpected dict/complex types to make failures
explicit.

135-146: Inconsistent truncation logging and potential noise.

Document truncation is logged (lines 137-138), but sequence truncation logging is commented out (line 146). Additionally, the commented-out code references original_length which would be undefined in that context since it was overwritten at line 135.

For large datasets with many long documents, the document truncation logging could produce excessive output. Consider:

Making truncation logging consistent (either log both or neither)
Using a counter and logging summary statistics at the end, rather than per-document

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 135 -
146, The code currently prints per-document truncation and has a commented-out,
buggy sequence-truncation log that references original_length incorrectly;
instead add counters (e.g., num_doc_truncated, num_seq_truncated) and track
original character length and original token length (capture
original_token_length = len(encoded) before truncating to
self.max_sequence_length) inside the block using _Encoder.tokenizer.encode and
the document trimming to self.max_document_length, increment the appropriate
counter when truncation occurs, remove per-document print calls, and emit a
single summary log at the end of preprocessing (including counts and maybe
totals like total_docs and total_truncated_tokens) so truncation reporting is
consistent and not noisy.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Line 90: The class-level _chat_template_logged set cannot deduplicate logs
across worker processes because each multiprocessing.Pool worker has its own
copy; either move the "Applying chat_template..." logging out of worker code
into the parent process before dispatching tasks (so logging/deduplication is
done once), or replace the in-process set with a process-shared structure
created from multiprocessing.Manager() (e.g., a Manager().dict or list used as a
set) and use that shared object where _chat_template_logged is referenced so all
Pool workers see the same dedupe state; update references to
_chat_template_logged and the code that writes the log accordingly.

---

Nitpick comments:
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 119-131: The code assumes value is list or string and assigns
non-string types directly to text, causing failures later in
_Encoder.tokenizer.encode(text); update the handling in _Encoder (in the block
around tokenizer.apply_chat_template) to validate value types: if value is list
handle as before, if value is str use directly, if value is None/other primitive
convert to string (e.g., str(value)) or skip/log a warning (use
_Encoder._chat_template_logged or a logger) and ensure text is always a string
before calling _Encoder.tokenizer.encode; add a clear warning or raise a
controlled TypeError for unexpected dict/complex types to make failures
explicit.
- Around line 135-146: The code currently prints per-document truncation and has
a commented-out, buggy sequence-truncation log that references original_length
incorrectly; instead add counters (e.g., num_doc_truncated, num_seq_truncated)
and track original character length and original token length (capture
original_token_length = len(encoded) before truncating to
self.max_sequence_length) inside the block using _Encoder.tokenizer.encode and
the document trimming to self.max_document_length, increment the appropriate
counter when truncation occurs, remove per-document print calls, and emit a
single summary log at the end of preprocessing (including counts and maybe
totals like total_docs and total_truncated_tokens) so truncation reporting is
consistent and not noisy.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f4953324-76ee-4bb8-8ae1-3ac0be1c4f1b

📥 Commits

Reviewing files that changed from the base of the PR and between 695c8e8 and 4726723.

📒 Files selected for processing (1)

modelopt/torch/utils/plugins/megatron_preprocess_data.py

modelopt/torch/utils/plugins/megatron_preprocess_data.py

codecov · 2026-03-10T21:53:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.09%. Comparing base (0214676) to head (4726723).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1018      +/-   ##
==========================================
- Coverage   70.11%   70.09%   -0.03%     
==========================================
  Files         220      220              
  Lines       25240    25240              
==========================================
- Hits        17698    17692       -6     
- Misses       7542     7548       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AAnoosheh · 2026-03-11T11:22:28Z

Can we add test in test_megatron_preprocess_data.py

kevalmorabia97 · 2026-03-11T11:53:13Z

Can we add test in test_megatron_preprocess_data.py

Do you know any small dataset we can use for test here?

AAnoosheh · 2026-03-11T12:52:50Z

The test_sft split of this dataset is very small: https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/viewer/default/test_sft

Support megatron tokenization for post training datasets

4726723

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 requested a review from AAnoosheh March 10, 2026 19:30

kevalmorabia97 requested a review from a team as a code owner March 10, 2026 19:30

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

modelopt/torch/utils/plugins/megatron_preprocess_data.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support megatron tokenization for post training datasets#1018

Support megatron tokenization for post training datasets#1018
kevalmorabia97 wants to merge 1 commit intomainfrom
kmorabia/post-training-tokenization

kevalmorabia97 commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Mar 10, 2026

Uh oh!

AAnoosheh commented Mar 11, 2026

Uh oh!

kevalmorabia97 commented Mar 11, 2026

Uh oh!

AAnoosheh commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kevalmorabia97 commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 10, 2026

Codecov Report

Uh oh!

AAnoosheh commented Mar 11, 2026

Uh oh!

kevalmorabia97 commented Mar 11, 2026

Uh oh!

AAnoosheh commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevalmorabia97 commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading