
We also reduce VRAM usage by 500MB or more by fusing the softcapping mechanism in the cross entropy loss calculation. The derivatives are also needed here. By fusing them in, we do not have to keep a copy of the logits before the softcapping operation, reducing VRAM usage, We verified the accuracy of our gradients by confirming if the losses match up.


