For the same input on the first sample, the output from self.model() is different under the two configurations: zero3 and not using deepspeed. Also, after self.attn followed by mlp, the result is Nan, causing the loss to be 0. The environment is A100.
The output is as follows:

