There is one GPU not detected on the motherboard after installing the Lambda stack. The BIOS is not changed and wondering where could be wrong. Other 3 GPUS are working fine.
Problem solved after shutdown the system and leave it for a while then reboot. Not sure why this happens but direct rebooting doesn’t work.
Software (e.g. UEFI, nvidia-smi,tensorflow, keras, torch) will sometimes detect fewer GPUs than expected (for example, detecting only 3 GPUs when the machine has 4). This occurs if one or more GPUs were above a specified temperature threshold when machine was booted.
As you experienced, the fix is to simply reboot the machine. In the future, if a reboot doesn’t work, shut the machine down for 5 minutes to let all GPUs cool and then turn the machine back on.
Why does this happen?
The GPU firmware interprets high temperatures at boot time as a potential cooling issue. The GPU firmware will refuse to reveal a GPU to the operating system / motherboard if it is above a certain temperature. This is an effort to prevent potential heat damage. Unfortunately the firmware is quite strict, and simply browsing the web and then rebooting the machine is sometimes enough to trigger the issue, let alone rebooting after a heavy GPU training job.