High CPU "steal time" on 8x A100 machine

kbumsik · June 22, 2023, 11:30am

Hi,

I have a gpu_8x_a100 instance in Virginia, USA region.

I am currently running multiple (like 24 single GPU training, 3 trainings each GPUs) different trainings for small vision models.

I am overall happy with the performance of the 8x A100 GPUs, however, I noticed that it is quite hard to reach 100% of GPU utilization.

I noticed that one of the reason is that the CPU has almost 50% of “VM steal time”. In the below screenshot, you can see the “yellow” bar, which indicates that the CPU is waiting from the VM host, takes almost half of the CPU time.

I figured out that this server runs on a QEMU server. So what causes this high CPU VM steal time in Lambda Cloud? Maybe high IO?

Thank you for answering in advance.

/Users/kbumsik/Library/Application Support/Dropbox-Capture/Thumbnails/Screenshot by Dropbox Capture.png

Topic		Replies	Views
CPU starvation on a gpu_1x_a10 instance Technical Help	0	22	March 13, 2025
How long does it usually take for GPUs to become available?	1	2132	November 26, 2022
Booting takes long time and then "alert" status for gpu_1x_h100_pcie Technical Help	2	2306	July 23, 2024
Folding @ Home Help Technical Help	1	1348	September 18, 2020
Why download speeds on h100 gpu extremely slow (~400KB/s)	1	462	March 15, 2024

High CPU "steal time" on 8x A100 machine

Related topics