New Quad: only 3 GPUs detected

We received our new Quad yesterday. Powered up and setup networking stuff.
However, only 3 of the 4 GPUs are detected.

# nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-401c4e35-ae2c-23ce-06a5-21467cfdfb05)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-e5af8408-f629-455a-df02-c31c1de0ecc6)
GPU 2: GeForce GTX 1080 Ti (UUID: GPU-8308c5d5-b95e-aefa-e2e3-da73d354f4d2)

# lspci |grep NVIDIA
05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
06:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)
09:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
09:00.1 Audio device: NVIDIA Corporation GP102 HDMI Audio Controller (rev a1)

Any ideas ?

  • Jerry
1 Like

NEVERMIND !! :slight_smile:

Did the nouveau.modeset.0 thing to grub, now see all four …
$ nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-401c4e35-ae2c-23ce-06a5-21467cfdfb05)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-e5af8408-f629-455a-df02-c31c1de0ecc6)
GPU 2: GeForce GTX 1080 Ti (UUID: GPU-8308c5d5-b95e-aefa-e2e3-da73d354f4d2)
GPU 3: GeForce GTX 1080 Ti (UUID: GPU-e65a265e-ee69-a13e-8b7a-c49e1ba4d963)

1 Like

Libraries (e.g. nvidia-smi,tensorflow, keras, torch) will sometimes detect fewer GPUs than expected (for example, detecting only 3 GPUs when the machine has 4). This occurs if one or more GPUs were above a specified temperature threshold when machine was booted. The fix is to simply reboot the machine. If a reboot doesn’t work, shut the machine down for 5 minutes to let all GPUs cool and then turn the machine back on.

Why does this happen?
The GPU firmware interprets high temperatures at boot time as a potential cooling issue. The GPU firmware will refuse to reveal a GPU to the operating system / motherboard if it is above a certain temperature. This is an effort to prevent potential heat damage. Unfortunately the firmware is quite strict, and simply browsing the web and then rebooting the machine is sometimes enough to trigger the issue, let alone rebooting after a heavy GPU training job.