I have a Lambda server with 2 RTX 3090 GPUs. However, one of the GPUs keeps getting an unknown error during training.
Even though rebooting can temporarily recover this GPU, how can I solve this problem forever? Does anyone else have this problem?
I have a Lambda server with 2 RTX 3090 GPUs. However, one of the GPUs keeps getting an unknown error during training.
Even though rebooting can temporarily recover this GPU, how can I solve this problem forever? Does anyone else have this problem?
running ‘sudo nvidia-bug-report.sh’ would produce a ‘nvidia-bug-report.log.gz’ which would be sufficient to see what is failing.
Then normally swap the two GPUs, this is to confirm:
Normally the GPU, but best to confirm to save time/money.
A quick look ‘nvidia-smi --query-gpu=index,pci.bus_id,uuid --format=csv’ will show both the GPU UUID (unique identifier) with the PCI Bus address.