Skip to content

Fix/fix autotp universal checkpoint ci#7937

Merged
sfc-gh-truwase merged 2 commits intodeepspeedai:masterfrom
tohtana:fix/fix-autotp-universal-checkpoint-ci
Mar 31, 2026
Merged

Fix/fix autotp universal checkpoint ci#7937
sfc-gh-truwase merged 2 commits intodeepspeedai:masterfrom
tohtana:fix/fix-autotp-universal-checkpoint-ci

Conversation

@tohtana
Copy link
Copy Markdown
Collaborator

@tohtana tohtana commented Mar 31, 2026

The full CI test fails throwing "RuntimeError: Cannot re-initialize CUDA" because of tests for universal checkpoint and AutoTP.

It happens because they run torch.cuda.current_device() under pytest --forked. As the tests only touch universal checkpoint metadata, we won't need to call it. This PR skips constructor-time AutoTP materialization when mp_group is None.
Partitioning still happens in the real AutoTP usage where an actual model-parallel group is given.

tohtana added 2 commits March 30, 2026 13:24
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@sfc-gh-truwase sfc-gh-truwase merged commit 3bdebc0 into deepspeedai:master Mar 31, 2026
10 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants