Lambda Vector 4x Quadro RTX 6000 Only See 3 GPUs

Looks like we may be seeing an IRQ conflict? How would we go about resolving this?

Thanks!

from syslog:

Aug 24 17:23:03 dc3dsgpu-app001 kernel: [  926.881151] genirq: Flags mismatch irq 263. 00000080 (nvidia) vs. 00000000 (nvidia)
Aug 24 17:23:03 dc3dsgpu-app001 kernel: [  926.881186] NVRM: GPU 0000:61:00.0: Tried to get IRQ 263, but another driver
Aug 24 17:23:03 dc3dsgpu-app001 kernel: [  926.881188] NVRM: has it and is not sharing it.
Aug 24 17:23:03 dc3dsgpu-app001 kernel: [  926.881189] NVRM: You may want to verify that no audio driver is using the IRQ.
Aug 24 17:23:03 dc3dsgpu-app001 kernel: [  926.881191] NVRM: GPU 0000:61:00.0: request_irq() failed (-16)
Aug 24 17:23:03 dc3dsgpu-app001 kernel: [  926.881253] genirq: Flags mismatch irq 263. 00000080 (nvidia) vs. 00000000 (nvidia)
Aug 24 17:23:03 dc3dsgpu-app001 kernel: [  926.881267] NVRM: GPU 0000:61:00.0: Tried to get IRQ 263, but another driver
Aug 24 17:23:03 dc3dsgpu-app001 kernel: [  926.881269] NVRM: has it and is not sharing it.
Aug 24 17:23:03 dc3dsgpu-app001 kernel: [  926.881269] NVRM: You may want to verify that no audio driver is using the IRQ.
Aug 24 17:23:03 dc3dsgpu-app001 kernel: [  926.881271] NVRM: GPU 0000:61:00.0: request_irq() failed (-16)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 6000     On   | 00000000:01:00.0 Off |                  Off |
| 33%   28C    P8    13W / 260W |      5MiB / 24220MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 6000     On   | 00000000:2E:00.0 Off |                  Off |
| 33%   35C    P8    13W / 260W |      5MiB / 24220MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 6000     On   | 00000000:41:00.0 Off |                  Off |
| 34%   38C    P8    11W / 260W |     73MiB / 24217MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2375      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      2375      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      2375      G   /usr/lib/xorg/Xorg                 64MiB |
|    2   N/A  N/A      2494      G   /usr/bin/gnome-shell                6MiB |
+-----------------------------------------------------------------------------+

lspci :

01:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
2e:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
41:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev a1)
61:00.0 VGA compatible controller: NVIDIA Corporation TU102GL [Quadro RTX 6000/8000] (rev ff)

Rebooted and disabled the Bluetooth controller, wifi controller, and audio controller which we don’t really need. Looks like the fourth CPU is still not coming online. While in the bios I did also check and saw the all the PCI slots were set to x16 speed.

Aug 25 16:38:06 dc3dsgpu-app001 kernel: [    6.671532] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.63.01  Tue Aug  3 20:44:16 UTC 2021
Aug 25 16:38:07 dc3dsgpu-app001 kernel: [    8.004153] NVRM: GPU at PCI:0000:61:00: GPU-f14244ba-2a85-4182-2410-4f8477f57d89
Aug 25 16:38:07 dc3dsgpu-app001 kernel: [    8.004156] NVRM: Xid (PCI:0000:61:00): 79, pid=2166, GPU has fallen off the bus.
Aug 25 16:38:07 dc3dsgpu-app001 kernel: [    8.004158] NVRM: GPU 0000:61:00.0: GPU has fallen off the bus.
Aug 25 16:38:07 dc3dsgpu-app001 kernel: [    8.004174] NVRM: GPU 0000:61:00.0: GPU serial number is 1322820062618.
Aug 25 16:38:27 dc3dsgpu-app001 kernel: [   28.283638] NVRM: GPU 0000:61:00.0: Tried to get IRQ 263, but another driver
Aug 25 16:38:27 dc3dsgpu-app001 kernel: [   28.283639] NVRM: has it and is not sharing it.
Aug 25 16:38:27 dc3dsgpu-app001 kernel: [   28.283640] NVRM: You may want to verify that no audio driver is using the IRQ.
Aug 25 16:38:27 dc3dsgpu-app001 kernel: [   28.283640] NVRM: GPU 0000:61:00.0: request_irq() failed (-16)
Aug 25 16:38:27 dc3dsgpu-app001 kernel: [   28.283741] NVRM: GPU 0000:61:00.0: Tried to get IRQ 263, but another driver
Aug 25 16:38:27 dc3dsgpu-app001 kernel: [   28.283742] NVRM: has it and is not sharing it.
Aug 25 16:38:27 dc3dsgpu-app001 kernel: [   28.283743] NVRM: You may want to verify that no audio driver is using the IRQ.
Aug 25 16:38:27 dc3dsgpu-app001 kernel: [   28.283743] NVRM: GPU 0000:61:00.0: request_irq() failed (-16)
Aug 25 16:38:28 dc3dsgpu-app001 kernel: [   29.305870] NVRM: GPU 0000:61:00.0: Tried to get IRQ 263, but another driver
Aug 25 16:38:28 dc3dsgpu-app001 kernel: [   29.305873] NVRM: has it and is not sharing it.
Aug 25 16:38:28 dc3dsgpu-app001 kernel: [   29.305874] NVRM: You may want to verify that no audio driver is using the IRQ.
Aug 25 16:38:28 dc3dsgpu-app001 kernel: [   29.305876] NVRM: GPU 0000:61:00.0: request_irq() failed (-16)
Aug 25 16:38:28 dc3dsgpu-app001 kernel: [   29.306009] NVRM: GPU 0000:61:00.0: Tried to get IRQ 263, but another driver
Aug 25 16:38:28 dc3dsgpu-app001 kernel: [   29.306011] NVRM: has it and is not sharing it.
Aug 25 16:38:28 dc3dsgpu-app001 kernel: [   29.306012] NVRM: You may want to verify that no audio driver is using the IRQ.
Aug 25 16:38:28 dc3dsgpu-app001 kernel: [   29.306014] NVRM: GPU 0000:61:00.0: request_irq() failed (-16)