Unable to determine the device handle for GPU, GPU is lost. Reboot the system to recover this GPU

I have a lambda blade server with 8 x RTX 3090s. Today I lost one of the GPUs.
Here is the message that I receive for
$ nvidia-smi -i 4

Unable to determine the device handle for GPU 0000:81:00.0: GPU is lost. Reboot the system to recover this GPU

Any one has this issue? Will reboot fix it or it is is hardware issue?

-Karun

As per the tech support, reboot did the trick.
Hope this is not a persistent vibration related issue.
-Karun

Karun,

Yes:
1. Run ‘sudo nvidia-bug-report.sh’ and collect the nvidia-bug-report.log.gz
2. Swap the GPU (depending on the system) on servers make sure it is swapped with
a GPU on the other PLX bus. (We are happy to work with our customers to isolate which PCI slot)
3. Then Run - if a GPU fails again after swapping, then get another ‘sudo nvidia-bug-report.sh’

The normal problems do vary, depending on the specific hardware and reason for the GPU failure.
The first ‘nvidia-bug-report.log.gz’ would tell me perhaps which reason it failed.

The reason for swapping the GPU with another GPU is:
1. Make sure the GPU is freshly and fully reseated, and power cord is not loose.
- If it follow the GPU it is normally the GPU failed.
2. It has a different NVLink (where applicable) and that the NVLink is properly connected.
3. Or if it is the PCI Bus on the mother or daughter board.
- If it fails on the same slot, swap the NVLink (if applicable)
- Last resort is the motherboard/PCI Slot replacement (if applicable on your system).

Some chassis we look for GPU memory errors (like A100’s are capable of handling memory remapping, but that requires a reboot). Other chassis sometimes there can be a issue with the GPU cage connection or the PCI slots are at time replacable.

So generally it depends on which GPU, and which motherboard/chassis you have.

It is the reason we ask for the nvidia-bug-report after the failure, and after the swap and a re-failure.
It is the only way to isolate the problem, unless there are other clues. (Like the memory errors on A100’s).

All the best to you! And I think I was taking with you today. But for others I wanted to let them know.

Mark

Hi, I am having the same issue where the GPU is lost during traininig, I tried to reboot many times and the issue still happen. My server is having 8 x Quadro RTX 8000. The link below is the nvidia-bug-report.log.gz before swapping the GPU. nvidia-bug-report.log.gz - Google Drive. Please assist. Thanks.