A100 seems to be strangely slow

bwanab · June 17, 2024, 11:06pm

This could be a model debug question, but I’ve got a modest sized convolutional network I’m training. After debugging it on my local Macbook Air M1 (2020), I launched an A100 instance to do the training. I was somewhat shocked that the model trains more slowly on the A100 than the M1 using MPS, but even more slowly than the M1 using cpu.

These are the approximate numbers per epoch of training:

A100: 90 seconds
M1 CPU: 80 seconds
M1 MPS: 42 seconds

I’ve been over the model to ensure that everything is getting loaded properly into GPU, but I’m going to guess I’ve got some data type that is not right for cuda.

Any insight would be welcome.

Topic		Replies	Views
H100 machine performance Technical Help	0	146	October 12, 2024
High CPU "steal time" on 8x A100 machine Technical Help	0	644	June 22, 2023
Downloading speed is slow on H100, only ~70MB/s Technical Help	4	1558	June 8, 2023
H100 performance	0	672	May 24, 2023
Model runs on A10, but not H100 Technical Help	1	1055	June 13, 2023

A100 seems to be strangely slow

Related topics