test code:
import bitsandbytes
uut=bitsandbytes.nn.modules.Linear4bit(100,100) # unit under test should have 100*100 weights,100 biases
print([i.numel() for i in uut.parameters()]) # [10000, 100],correct
uut.cuda()
print([i.numel() for i in uut.parameters()]) # [5000, 100],lost 50% weights after copy to gpu
uut.cpu()
print([i.numel() for i in uut.parameters()]) # [5000, 100],lost 50% weights after copy to gpu
test code: