[DRAFT] feat(zero2): add CPU offload support for Muon optimizer by delock · Pull Request #7939 · deepspeedai/DeepSpeed

delock · 2026-03-31T06:59:55Z

Add Muon optimizer support in ZeRO Stage 1&2 CPU offload path by:

If a parameter is cross parititon boundary, the full param grad instead of partial param grad will be copied to CPU.
muon momentum buffer will be stored on CPU memory to save GPU memory.
muon update will be done on CPU, utilizing CPU matmul.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54364fbe9a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T07:06:27Z

deepspeed/runtime/zero/stage_1_and_2.py

+                    pad_tensor = torch.zeros(padded_size - self.bit16_groups_flat[i].numel(),
+                                             dtype=self.bit16_groups_flat[i].dtype,
+                                             device=self.bit16_groups_flat[i].device)
+                    self.bit16_groups_flat[i] = torch.cat([self.bit16_groups_flat[i], pad_tensor])


Insert per-partition padding before Muon equal split

Appending a single padding block at the tail does not guarantee parameter-boundary partitioning: when an earlier partition is smaller than max_partition_size (e.g., sizes [4,5,1] for dp=3), get_data_parallel_partitions() still cuts at fixed max_partition_size offsets and splits a parameter across ranks. That breaks the new CPU-offload Muon path, which assumes unsplit parameters and writes a full update.view(-1) into a partition slice computed from grad_position, leading to shape mismatch or incorrect updates when source_offset != 0.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-31T07:06:27Z

deepspeed/runtime/zero/stage_1_and_2.py

+            if self._is_muon_param_group(i):
+                dp_size = dist.get_world_size(group=self.real_dp_process_group[i])
+                max_ps = self._get_muon_max_partition_size(self.round_robin_bit16_groups[i], dp_size, orig_group_numel)
+                padded_size = max_ps * dp_size


Keep Muon partition size aligned for NCCL boundaries

max_partition_size is used directly to set padded_size, but it is not rounded to the existing NCCL start-alignment factor. If max_partition_size is odd with fp16/bf16 tensors, partition starts after rank 0 become 2-byte shifted and fail the existing 4-byte alignment assertion in the same initialization flow. This makes valid Muon configurations crash depending on parameter shapes.

Useful? React with 👍 / 👎.

Add Muon optimizer support in ZeRO Stage 1&2 CPU offload path by: 1. Partition strategy: Muon param groups now partition by parameter boundaries (never split a param across ranks), padding to uniform max size for all-gather compatibility. Logs padding overhead ratio. 2. CPU Newton-Schulz: Add muon_update_cpu() and zeropower_via_newtonschulz5_cpu() using PyTorch CPU bf16 matmul as baseline. Architecture allows future replacement with AMX C++ kernel. 3. CPU offload integration: _apply_muon_update_for_cpu_offload() copies complete gradients to CPU, runs muon_update on CPU (momentum buffer stays on CPU), writes result to FP32 grad buffer. No extra PCIe transfers. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

Some CPUs lack hardware bf16 matmul support (AMX/AVX-512-BF16), causing NS iterations to be ~800x slower than fp32 via MKL. This change uses fp32 if CPU does not support bf16, reducing CPU offload NS time from ~18s to ~24ms for 512x2048 matrices. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>

delock requested review from loadams, tjruwase and tohtana as code owners March 31, 2026 06:59

delock marked this pull request as draft March 31, 2026 07:02

chatgpt-codex-connector bot reviewed Mar 31, 2026

View reviewed changes

delock force-pushed the gma/muon_cpuoffload branch 2 times, most recently from d802f0e to c058864 Compare March 31, 2026 10:07

delock added 2 commits March 31, 2026 03:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] feat(zero2): add CPU offload support for Muon optimizer#7939

[DRAFT] feat(zero2): add CPU offload support for Muon optimizer#7939
delock wants to merge 2 commits intodeepspeedai:masterfrom
delock:gma/muon_cpuoffload

delock commented Mar 31, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 31, 2026

Uh oh!

chatgpt-codex-connector bot Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

delock commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

delock commented Mar 31, 2026 •

edited

Loading