A100 seems to be strangely slow

This could be a model debug question, but I’ve got a modest sized convolutional network I’m training. After debugging it on my local Macbook Air M1 (2020), I launched an A100 instance to do the training. I was somewhat shocked that the model trains more slowly on the A100 than the M1 using MPS, but even more slowly than the M1 using cpu.

These are the approximate numbers per epoch of training:

A100: 90 seconds
M1 CPU: 80 seconds
M1 MPS: 42 seconds

I’ve been over the model to ensure that everything is getting loaded properly into GPU, but I’m going to guess I’ve got some data type that is not right for cuda.

Any insight would be welcome.