How can I maximize speed on Lambda GPU instances?

I am trying to train a reinforcement learning model in Pytorch. I tried a GH200 instance earlier, and now I am trying an 8xH100 instance, and I am finding it unexpectedly slow. Are there any techniques that I can use to take better advantage of the GPU’s capabilities? I am wondering particularly if there are any changes I can make to the environmental variables, or special Pytorch techniques that I should use.

Can you specify in greater detail what you mean by the training is slow? If you have log files, that might help with troubleshooting the issue.