I have a gpu_8x_a100 instance in Virginia, USA region.
I am currently running multiple (like 24 single GPU training, 3 trainings each GPUs) different trainings for small vision models.
I am overall happy with the performance of the 8x A100 GPUs, however, I noticed that it is quite hard to reach 100% of GPU utilization.
I noticed that one of the reason is that the CPU has almost 50% of “VM steal time”. In the below screenshot, you can see the “yellow” bar, which indicates that the CPU is waiting from the VM host, takes almost half of the CPU time.
I figured out that this server runs on a QEMU server. So what causes this high CPU VM steal time in Lambda Cloud? Maybe high IO?
Thank you for answering in advance.