High CPU "steal time" on 8x A100 machine


I have a gpu_8x_a100 instance in Virginia, USA region.

I am currently running multiple (like 24 single GPU training, 3 trainings each GPUs) different trainings for small vision models.

I am overall happy with the performance of the 8x A100 GPUs, however, I noticed that it is quite hard to reach 100% of GPU utilization.

I noticed that one of the reason is that the CPU has almost 50% of “VM steal time”. In the below screenshot, you can see the “yellow” bar, which indicates that the CPU is waiting from the VM host, takes almost half of the CPU time.

I figured out that this server runs on a QEMU server. So what causes this high CPU VM steal time in Lambda Cloud? Maybe high IO?

Thank you for answering in advance.

