Unable to determine the device handle for GPU0000:21:00.0: Unknown Error

Happy · July 26, 2023, 7:36pm

I have a Lambda server with 2 RTX 3090 GPUs. However, one of the GPUs keeps getting an unknown error during training.

Even though rebooting can temporarily recover this GPU, how can I solve this problem forever? Does anyone else have this problem?

markd · August 1, 2023, 3:40pm

running ‘sudo nvidia-bug-report.sh’ would produce a ‘nvidia-bug-report.log.gz’ which would be sufficient to see what is failing.

Then normally swap the two GPUs, this is to confirm:

GPUs are reseated
That the failure follows GPU versus the PCI slot
If there is a NVLink removing the NVLink to confirm it is not the issue.

Normally the GPU, but best to confirm to save time/money.
A quick look ‘nvidia-smi --query-gpu=index,pci.bus_id,uuid --format=csv’ will show both the GPU UUID (unique identifier) with the PCI Bus address.

Topic		Replies	Views
Unable to determine the device handle for GPU, GPU is lost. Reboot the system to recover this GPU	3	10419	March 12, 2024
Getting error "Unable to determine the device handle for GPU0000:3E:00.0: Unknown Error" Technical Help	0	400	May 20, 2024
GPU not Detected by UEFI Technical Help	2	2286	November 14, 2018
New Quad: only 3 GPUs detected Technical Help	2	2084	October 11, 2018
Lambda Vector 4x Quadro RTX 6000 Only See 3 GPUs Technical Help	1	1573	August 30, 2021

Unable to determine the device handle for GPU0000:21:00.0: Unknown Error

Related topics