NVLink seems to prevent PyTorch training loop from starting

I have a Lambda server with eight A6000 GPUs and NVLink.

When I utilize PyTorch’s distributed data parallel (DDP) to train my models with two GPUs, NVLink is used successfully based on the performance counters. However, as soon as I increase the number of GPUs to three or more (up to eight), the training loop gets stuck at the very beginning (or possibly the first backward pass).

If I set the environment variable NCCL_P2P_DISABLE=1, I can use as many GPUs as I like, but I obviously don’t get the benefits of NVLink.

If I set NCCL_DEBUG=INFO, I get the following output:

kaveh:746281:746281 [0] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746281:746281 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746281:746281 [0] NCCL INFO NET/IB : No device found.
kaveh:746281:746281 [0] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746281:746281 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
kaveh:746286:746286 [5] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746290:746290 [7] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746282:746282 [1] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746288:746288 [6] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746285:746285 [4] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746283:746283 [2] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746290:746290 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746286:746286 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746288:746288 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746282:746282 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746285:746285 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746283:746283 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746290:746290 [7] NCCL INFO NET/IB : No device found.
kaveh:746283:746283 [2] NCCL INFO NET/IB : No device found.
kaveh:746288:746288 [6] NCCL INFO NET/IB : No device found.
kaveh:746282:746282 [1] NCCL INFO NET/IB : No device found.
kaveh:746286:746286 [5] NCCL INFO NET/IB : No device found.
kaveh:746285:746285 [4] NCCL INFO NET/IB : No device found.
kaveh:746290:746290 [7] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746290:746290 [7] NCCL INFO Using network Socket
kaveh:746286:746286 [5] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746286:746286 [5] NCCL INFO Using network Socket
kaveh:746285:746285 [4] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746288:746288 [6] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746282:746282 [1] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746283:746283 [2] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746285:746285 [4] NCCL INFO Using network Socket
kaveh:746288:746288 [6] NCCL INFO Using network Socket
kaveh:746282:746282 [1] NCCL INFO Using network Socket
kaveh:746283:746283 [2] NCCL INFO Using network Socket
kaveh:746284:746284 [3] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746284:746284 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746284:746284 [3] NCCL INFO NET/IB : No device found.
kaveh:746284:746284 [3] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746284:746284 [3] NCCL INFO Using network Socket
kaveh:746285:746447 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 7/-1/-1->4->5 [3] 7/-1/-1->4->5 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 7/-1/-1->4->5 [7] 7/-1/-1->4->5
kaveh:746285:746447 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff,00000000,00000000
kaveh:746286:746446 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 4/-1/-1->5->2 [3] 4/-1/-1->5->2 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 4/-1/-1->5->2 [7] 4/-1/-1->5->2
kaveh:746286:746446 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff,00000000,00000000
kaveh:746288:746448 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 1/-1/-1->6->7 [3] 1/-1/-1->6->7 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 1/-1/-1->6->7 [7] 1/-1/-1->6->7
kaveh:746288:746448 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff,00000000,00000000
kaveh:746290:746445 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] 6/-1/-1->7->4 [3] 6/-1/-1->7->4 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] 6/-1/-1->7->4 [7] 6/-1/-1->7->4
kaveh:746290:746445 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff,00000000,00000000
kaveh:746281:746444 [0] NCCL INFO Channel 00/08 :    0   1   2   3   4   5   6   7
kaveh:746281:746444 [0] NCCL INFO Channel 01/08 :    0   1   2   3   4   5   6   7
kaveh:746281:746444 [0] NCCL INFO Channel 02/08 :    0   3   2   5   4   7   6   1
kaveh:746281:746444 [0] NCCL INFO Channel 03/08 :    0   3   2   5   4   7   6   1
kaveh:746281:746444 [0] NCCL INFO Channel 04/08 :    0   1   2   3   4   5   6   7
kaveh:746281:746444 [0] NCCL INFO Channel 05/08 :    0   1   2   3   4   5   6   7
kaveh:746281:746444 [0] NCCL INFO Channel 06/08 :    0   3   2   5   4   7   6   1
kaveh:746281:746444 [0] NCCL INFO Channel 07/08 :    0   3   2   5   4   7   6   1
kaveh:746282:746449 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] -1/-1/-1->1->6 [3] -1/-1/-1->1->6 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] -1/-1/-1->1->6 [7] -1/-1/-1->1->6
kaveh:746281:746444 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 3/-1/-1->0->-1 [3] 3/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 3/-1/-1->0->-1 [7] 3/-1/-1->0->-1
kaveh:746283:746450 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 5/-1/-1->2->3 [3] 5/-1/-1->2->3 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 5/-1/-1->2->3 [7] 5/-1/-1->2->3
kaveh:746284:746451 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 2/-1/-1->3->0 [3] 2/-1/-1->3->0 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 2/-1/-1->3->0 [7] 2/-1/-1->3->0
kaveh:746282:746449 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
kaveh:746281:746444 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
kaveh:746283:746450 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
kaveh:746284:746451 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
kaveh:746285:746447 [4] NCCL INFO Channel 00 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 00 : 5[a1000] -> 6[c1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 00 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 00 : 7[e1000] -> 0[1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 01 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 00 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 01 : 5[a1000] -> 6[c1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 01 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 01 : 7[e1000] -> 0[1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 00 : 3[61000] -> 4[81000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 00 : 0[1000] -> 1[25000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 04 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 01 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 00 : 1[25000] -> 2[41000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 04 : 5[a1000] -> 6[c1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 04 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 04 : 7[e1000] -> 0[1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 01 : 3[61000] -> 4[81000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 01 : 0[1000] -> 1[25000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 05 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 04 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 01 : 1[25000] -> 2[41000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 05 : 5[a1000] -> 6[c1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 05 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 05 : 7[e1000] -> 0[1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 04 : 3[61000] -> 4[81000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 04 : 0[1000] -> 1[25000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 05 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 04 : 1[25000] -> 2[41000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 05 : 0[1000] -> 1[25000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 05 : 3[61000] -> 4[81000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 05 : 1[25000] -> 2[41000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 02 : 0[1000] -> 3[61000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 03 : 0[1000] -> 3[61000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 02 : 2[41000] -> 5[a1000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 06 : 0[1000] -> 3[61000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 03 : 2[41000] -> 5[a1000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 07 : 0[1000] -> 3[61000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 06 : 2[41000] -> 5[a1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 07 : 2[41000] -> 5[a1000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 02 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 03 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 02 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 02 : 6[c1000] -> 1[25000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 02 : 4[81000] -> 7[e1000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 06 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 03 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 03 : 6[c1000] -> 1[25000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 03 : 4[81000] -> 7[e1000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 07 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 06 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 06 : 6[c1000] -> 1[25000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 06 : 4[81000] -> 7[e1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 07 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 07 : 6[c1000] -> 1[25000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 07 : 4[81000] -> 7[e1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Connected all rings
kaveh:746284:746451 [3] NCCL INFO Connected all rings
kaveh:746283:746450 [2] NCCL INFO Channel 02 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 03 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 02 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 06 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 03 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 07 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 06 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 02 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 07 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 03 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 06 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 02 : 3[61000] -> 0[1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Connected all rings
kaveh:746286:746446 [5] NCCL INFO Connected all rings
kaveh:746282:746449 [1] NCCL INFO Channel 07 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 03 : 3[61000] -> 0[1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 02 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 06 : 3[61000] -> 0[1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 03 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 07 : 3[61000] -> 0[1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Connected all rings
kaveh:746290:746445 [7] NCCL INFO Connected all rings
kaveh:746281:746444 [0] NCCL INFO Connected all rings
kaveh:746282:746449 [1] NCCL INFO Connected all rings
kaveh:746285:746447 [4] NCCL INFO Channel 06 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 02 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 02 : 1[25000] -> 6[c1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 07 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 03 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 03 : 1[25000] -> 6[c1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 06 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 06 : 1[25000] -> 6[c1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 07 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 07 : 1[25000] -> 6[c1000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 02 : 5[a1000] -> 2[41000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 03 : 5[a1000] -> 2[41000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 06 : 5[a1000] -> 2[41000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 02 : 7[e1000] -> 4[81000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 07 : 5[a1000] -> 2[41000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 03 : 7[e1000] -> 4[81000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 06 : 7[e1000] -> 4[81000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 07 : 7[e1000] -> 4[81000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 00 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 00 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 01 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 01 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 00 : 2[41000] -> 1[25000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 00 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 04 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 04 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 01 : 2[41000] -> 1[25000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 01 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 05 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 05 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 04 : 2[41000] -> 1[25000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 04 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 00 : 4[81000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 00 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 05 : 2[41000] -> 1[25000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 00 : 6[c1000] -> 5[a1000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 05 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 01 : 4[81000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 01 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 01 : 6[c1000] -> 5[a1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 04 : 4[81000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 04 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 04 : 6[c1000] -> 5[a1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 05 : 4[81000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 05 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 05 : 6[c1000] -> 5[a1000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Connected all trees
kaveh:746281:746444 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746281:746444 [0] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746290:746445 [7] NCCL INFO Connected all trees
kaveh:746290:746445 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746290:746445 [7] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746283:746450 [2] NCCL INFO Connected all trees
kaveh:746283:746450 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746283:746450 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746285:746447 [4] NCCL INFO Connected all trees
kaveh:746285:746447 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746285:746447 [4] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746282:746449 [1] NCCL INFO Connected all trees
kaveh:746282:746449 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746282:746449 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746284:746451 [3] NCCL INFO Connected all trees
kaveh:746284:746451 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746284:746451 [3] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746286:746446 [5] NCCL INFO Connected all trees
kaveh:746286:746446 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746286:746446 [5] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746288:746448 [6] NCCL INFO Connected all trees
kaveh:746288:746448 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746288:746448 [6] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746286:746446 [5] NCCL INFO comm 0x7fc508002fb0 rank 5 nranks 8 cudaDev 5 busId a1000 - Init COMPLETE
kaveh:746283:746450 [2] NCCL INFO comm 0x7f6138002fb0 rank 2 nranks 8 cudaDev 2 busId 41000 - Init COMPLETE
kaveh:746282:746449 [1] NCCL INFO comm 0x7f94a8002fb0 rank 1 nranks 8 cudaDev 1 busId 25000 - Init COMPLETE
kaveh:746288:746448 [6] NCCL INFO comm 0x7f5170002fb0 rank 6 nranks 8 cudaDev 6 busId c1000 - Init COMPLETE
kaveh:746285:746447 [4] NCCL INFO comm 0x7fa414002fb0 rank 4 nranks 8 cudaDev 4 busId 81000 - Init COMPLETE
kaveh:746290:746445 [7] NCCL INFO comm 0x7f336c002fb0 rank 7 nranks 8 cudaDev 7 busId e1000 - Init COMPLETE
kaveh:746284:746451 [3] NCCL INFO comm 0x7f40e4002fb0 rank 3 nranks 8 cudaDev 3 busId 61000 - Init COMPLETE
kaveh:746281:746444 [0] NCCL INFO comm 0x7f79bc002fb0 rank 0 nranks 8 cudaDev 0 busId 1000 - Init COMPLETE
kaveh:746281:746281 [0] NCCL INFO Launch mode Parallel

I was wondering if anyone has had a similar issue and knows how to resolve it.
Thanks!

If it is getting ‘stuck’ or ‘hanging’ do you have AMD CPUs?
And do you have the kernel command line set to iommu=soft?

Another issue I have seen was cross hosts and there were processes already running on the other GPU. Some times it took a little while for the previous job cleanup. (nvidia-smi to see what is running on the GPUs).

I do not use the NCCL_P2P_DISABLE=1
I normally run the DDP, NCCL test with:
NCCL_DEBUG=INFO NCCL_ALGO=Ring NCCL_NET_GDR_LEVEL=4 python …

To see your nvlink topology:
$ nvidia-smi topo -m
And to see the nvlink status:
$ nvidia-smi nvlink -s

You can use ‘CUDA_VISIBLE_DEVICES=0,1’ to run on the first two GPUs or ‘CUDA_VISIBLE_DEVICES=1,2’ to run on the second and third devices (if you are trying to compare).

You can see that with:
$ cat /proc/cmdline
$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.15.0-37-generic root=UUID=… ro quiet splash iommu=soft

  • or something similar

To fix it you can:
$ sudo sed -iE ‘/^GRUB_CMDLINE.*DEFAULT/ s/“$/ iommu=soft”/’ /etc/default/grub
$ sudo update-grub

Then reboot.

1 Like

Thank you for your detailed explanation. I should have mentioned that I use AMD CPUs and I don’t see this issue when using two GPUs.

The problem was indeed resolved by setting iommu=soft.
Given your explanation, I found a similar solution that relied on setting pci=noats, but it didn’t work for me.
It appears that there is currently no way to use the hardware MMU to get better performance.

Ironically, I don’t get a noticeable speedup when setting iommu=soft and NCCL_DEBUG=INFO NCCL_ALGO=Ring NCCL_NET_GDR_LEVEL=4. Here is the output of nvidia-smi topo -m:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      NV4     NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0
GPU1    NV4      X      NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0
GPU2    NODE    NODE     X      NV4     SYS     SYS     SYS     SYS     0-63,128-191    0
GPU3    NODE    NODE    NV4      X      SYS     SYS     SYS     SYS     0-63,128-191    0
GPU4    SYS     SYS     SYS     SYS      X      NV4     NODE    NODE    64-127,192-254  1
GPU5    SYS     SYS     SYS     SYS     NV4      X      NODE    NODE    64-127,192-254  1
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NV4     64-127,192-254  1
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NV4      X      64-127,192-254  1