Yes:
1. Run ‘sudo nvidia-bug-report.sh’ and collect the nvidia-bug-report.log.gz
2. Swap the GPU (depending on the system) on servers make sure it is swapped with
a GPU on the other PLX bus. (We are happy to work with our customers to isolate which PCI slot)
3. Then Run - if a GPU fails again after swapping, then get another ‘sudo nvidia-bug-report.sh’
The normal problems do vary, depending on the specific hardware and reason for the GPU failure.
The first ‘nvidia-bug-report.log.gz’ would tell me perhaps which reason it failed.
The reason for swapping the GPU with another GPU is:
1. Make sure the GPU is freshly and fully reseated, and power cord is not loose.
- If it follow the GPU it is normally the GPU failed.
2. It has a different NVLink (where applicable) and that the NVLink is properly connected.
3. Or if it is the PCI Bus on the mother or daughter board.
- If it fails on the same slot, swap the NVLink (if applicable)
- Last resort is the motherboard/PCI Slot replacement (if applicable on your system).
Some chassis we look for GPU memory errors (like A100’s are capable of handling memory remapping, but that requires a reboot). Other chassis sometimes there can be a issue with the GPU cage connection or the PCI slots are at time replacable.
So generally it depends on which GPU, and which motherboard/chassis you have.
It is the reason we ask for the nvidia-bug-report after the failure, and after the swap and a re-failure.
It is the only way to isolate the problem, unless there are other clues. (Like the memory errors on A100’s).
All the best to you! And I think I was taking with you today. But for others I wanted to let them know.
Hi, I am having the same issue where the GPU is lost during traininig, I tried to reboot many times and the issue still happen. My server is having 8 x Quadro RTX 8000. The link below is the nvidia-bug-report.log.gz before swapping the GPU. nvidia-bug-report.log.gz - Google Drive. Please assist. Thanks.