Libraries (e.g. nvidia-smi,tensorflow, keras, torch) will sometimes detect fewer GPUs than expected (for example, detecting only 3 GPUs when the machine has 4). This occurs if one or more GPUs were above a specified temperature threshold when machine was booted. The fix is to simply reboot the machine. If a reboot doesn’t work, shut the machine down for 5 minutes to let all GPUs cool and then turn the machine back on.
Why does this happen?
The GPU firmware interprets high temperatures at boot time as a potential cooling issue. The GPU firmware will refuse to reveal a GPU to the operating system / motherboard if it is above a certain temperature. This is an effort to prevent potential heat damage. Unfortunately the firmware is quite strict, and simply browsing the web and then rebooting the machine is sometimes enough to trigger the issue, let alone rebooting after a heavy GPU training job.