PyTorch Data Parallel: Unexplained System Crash on Lambda Workstation

I am experiencing an issue with PyTorch on one of our relatively new 4-gpu workstations (2080tis), that utilizing torch.nn.DataParallel on more than 2 gpus will cause restarts (randomly at some point in the training).

The closest thread I could find was this, which leads me to believe it might be a hardware issue?

gpuburn works fine, however, trying GitHub - ryujaehun/pytorch-gpu-benchmark: Using the famous cnn model in Pytorch, we run benchmarks on various gpu. leads to crash.

No kernel logs, no sign that OOM has killed any tasks, gpu memories barely over 40%, and monitoring power led to nothing.

PyTorch version 1.6 and 1.7 were tested, cuda 10.0, 10.2, 11.2 were tested (CUDA_HOME was set), upgrading nvidia driver did not help, and python3.6, 3.7, and 3.9 were tested.

Any idea what could be the issue?

Thank you!

Here is the output of lscpi:

00:00.0 Host bridge: Intel Corporation Sky Lake-E DMI3 Registers (rev 04)
00:04.0 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.1 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.2 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.3 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.4 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.5 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.6 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:04.7 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 04)
00:05.0 System peripheral: Intel Corporation Sky Lake-E MM/Vt-d Configuration Registers (rev 04)
00:05.2 System peripheral: Intel Corporation Device 2025 (rev 04)
00:05.4 PIC: Intel Corporation Device 2026 (rev 04)
00:08.0 System peripheral: Intel Corporation Sky Lake-E Ubox Registers (rev 04)
00:08.1 Performance counters: Intel Corporation Sky Lake-E Ubox Registers (rev 04)
00:08.2 System peripheral: Intel Corporation Sky Lake-E Ubox Registers (rev 04)
00:14.0 USB controller: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller
00:14.2 Signal processing controller: Intel Corporation 200 Series PCH Thermal Subsystem
00:16.0 Communication controller: Intel Corporation 200 Series PCH CSME HECI #1
00:17.0 SATA controller: Intel Corporation 200 Series PCH SATA controller [AHCI mode]
00:1b.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #17 (rev f0)
00:1b.4 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #21 (rev f0)
00:1d.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #9 (rev f0)
00:1d.2 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #11 (rev f0)
00:1f.0 ISA bridge: Intel Corporation X299 Chipset LPC/eSPI Controller
00:1f.2 Memory controller: Intel Corporation 200 Series/Z370 Chipset Family Power Management Controller
00:1f.3 Audio device: Intel Corporation 200 Series PCH HD Audio
00:1f.4 SMBus: Intel Corporation 200 Series/Z370 Chipset Family SMBus Controller
16:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
16:05.0 System peripheral: Intel Corporation Device 2034 (rev 04)
16:05.2 System peripheral: Intel Corporation Sky Lake-E RAS Configuration Registers (rev 04)
16:05.4 PIC: Intel Corporation Device 2036 (rev 04)
16:08.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:08.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:09.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:09.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:09.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:09.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:09.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:09.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:09.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:09.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0a.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0a.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0e.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0f.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0f.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0f.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0f.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0f.4 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0f.5 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0f.6 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:0f.7 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:10.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:10.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1d.0 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1d.1 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1d.2 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1d.3 System peripheral: Intel Corporation Sky Lake-E CHA Registers (rev 04)
16:1e.0 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.1 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.2 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.3 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.4 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.5 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
16:1e.6 System peripheral: Intel Corporation Sky Lake-E PCU Registers (rev 04)
17:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
18:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
18:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
19:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
19:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
19:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
19:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
1a:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
1a:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
1a:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
1a:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
64:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
64:05.0 System peripheral: Intel Corporation Device 2034 (rev 04)
64:05.2 System peripheral: Intel Corporation Sky Lake-E RAS Configuration Registers (rev 04)
64:05.4 PIC: Intel Corporation Device 2036 (rev 04)
64:08.0 System peripheral: Intel Corporation Device 2066 (rev 04)
64:09.0 System peripheral: Intel Corporation Device 2066 (rev 04)
64:0a.0 System peripheral: Intel Corporation Device 2040 (rev 04)
64:0a.1 System peripheral: Intel Corporation Device 2041 (rev 04)
64:0a.2 System peripheral: Intel Corporation Device 2042 (rev 04)
64:0a.3 System peripheral: Intel Corporation Device 2043 (rev 04)
64:0a.4 System peripheral: Intel Corporation Device 2044 (rev 04)
64:0a.5 System peripheral: Intel Corporation Device 2045 (rev 04)
64:0a.6 System peripheral: Intel Corporation Device 2046 (rev 04)
64:0a.7 System peripheral: Intel Corporation Device 2047 (rev 04)
64:0b.0 System peripheral: Intel Corporation Device 2048 (rev 04)
64:0b.1 System peripheral: Intel Corporation Device 2049 (rev 04)
64:0b.2 System peripheral: Intel Corporation Device 204a (rev 04)
64:0b.3 System peripheral: Intel Corporation Device 204b (rev 04)
64:0c.0 System peripheral: Intel Corporation Device 2040 (rev 04)
64:0c.1 System peripheral: Intel Corporation Device 2041 (rev 04)
64:0c.2 System peripheral: Intel Corporation Device 2042 (rev 04)
64:0c.3 System peripheral: Intel Corporation Device 2043 (rev 04)
64:0c.4 System peripheral: Intel Corporation Device 2044 (rev 04)
64:0c.5 System peripheral: Intel Corporation Device 2045 (rev 04)
64:0c.6 System peripheral: Intel Corporation Device 2046 (rev 04)
64:0c.7 System peripheral: Intel Corporation Device 2047 (rev 04)
64:0d.0 System peripheral: Intel Corporation Device 2048 (rev 04)
64:0d.1 System peripheral: Intel Corporation Device 2049 (rev 04)
64:0d.2 System peripheral: Intel Corporation Device 204a (rev 04)
64:0d.3 System peripheral: Intel Corporation Device 204b (rev 04)
65:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
66:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
66:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
67:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
67:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
67:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
67:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
68:00.0 VGA compatible controller: NVIDIA Corporation Device 1e04 (rev a1)
68:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
68:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
68:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
b2:03.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port D (rev 04)
b2:05.0 System peripheral: Intel Corporation Device 2034 (rev 04)
b2:05.2 System peripheral: Intel Corporation Sky Lake-E RAS Configuration Registers (rev 04)
b2:05.4 PIC: Intel Corporation Device 2036 (rev 04)
b2:12.0 Performance counters: Intel Corporation Sky Lake-E M3KTI Registers (rev 04)
b2:12.1 Performance counters: Intel Corporation Sky Lake-E M3KTI Registers (rev 04)
b2:12.2 System peripheral: Intel Corporation Sky Lake-E M3KTI Registers (rev 04)
b2:15.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 04)
b2:16.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 04)
b2:16.4 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 04)
b2:17.0 System peripheral: Intel Corporation Sky Lake-E M2PCI Registers (rev 04)
b3:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
b3:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)