Unable to determine the device handle for GPU0000:21:00.0: Unknown Error

I have a Lambda server with 2 RTX 3090 GPUs. However, one of the GPUs keeps getting an unknown error during training.

Even though rebooting can temporarily recover this GPU, how can I solve this problem forever? Does anyone else have this problem?

running ‘sudo nvidia-bug-report.sh’ would produce a ‘nvidia-bug-report.log.gz’ which would be sufficient to see what is failing.

Then normally swap the two GPUs, this is to confirm:

  1. GPUs are reseated
  2. That the failure follows GPU versus the PCI slot
  3. If there is a NVLink removing the NVLink to confirm it is not the issue.

Normally the GPU, but best to confirm to save time/money.
A quick look ‘nvidia-smi --query-gpu=index,pci.bus_id,uuid --format=csv’ will show both the GPU UUID (unique identifier) with the PCI Bus address.