I have a Lambda server with eight A6000 GPUs and NVLink.
When I utilize PyTorch’s distributed data parallel (DDP) to train my models with two GPUs, NVLink is used successfully based on the performance counters. However, as soon as I increase the number of GPUs to three or more (up to eight), the training loop gets stuck at the very beginning (or possibly the first backward pass).
If I set the environment variable NCCL_P2P_DISABLE=1
, I can use as many GPUs as I like, but I obviously don’t get the benefits of NVLink.
If I set NCCL_DEBUG=INFO
, I get the following output:
kaveh:746281:746281 [0] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746281:746281 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746281:746281 [0] NCCL INFO NET/IB : No device found.
kaveh:746281:746281 [0] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746281:746281 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
kaveh:746286:746286 [5] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746290:746290 [7] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746282:746282 [1] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746288:746288 [6] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746285:746285 [4] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746283:746283 [2] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746290:746290 [7] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746286:746286 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746288:746288 [6] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746282:746282 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746285:746285 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746283:746283 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746290:746290 [7] NCCL INFO NET/IB : No device found.
kaveh:746283:746283 [2] NCCL INFO NET/IB : No device found.
kaveh:746288:746288 [6] NCCL INFO NET/IB : No device found.
kaveh:746282:746282 [1] NCCL INFO NET/IB : No device found.
kaveh:746286:746286 [5] NCCL INFO NET/IB : No device found.
kaveh:746285:746285 [4] NCCL INFO NET/IB : No device found.
kaveh:746290:746290 [7] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746290:746290 [7] NCCL INFO Using network Socket
kaveh:746286:746286 [5] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746286:746286 [5] NCCL INFO Using network Socket
kaveh:746285:746285 [4] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746288:746288 [6] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746282:746282 [1] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746283:746283 [2] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746285:746285 [4] NCCL INFO Using network Socket
kaveh:746288:746288 [6] NCCL INFO Using network Socket
kaveh:746282:746282 [1] NCCL INFO Using network Socket
kaveh:746283:746283 [2] NCCL INFO Using network Socket
kaveh:746284:746284 [3] NCCL INFO Bootstrap : Using eno1:10.136.200.67<0>
kaveh:746284:746284 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
kaveh:746284:746284 [3] NCCL INFO NET/IB : No device found.
kaveh:746284:746284 [3] NCCL INFO NET/Socket : Using [0]eno1:10.136.200.67<0>
kaveh:746284:746284 [3] NCCL INFO Using network Socket
kaveh:746285:746447 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 7/-1/-1->4->5 [3] 7/-1/-1->4->5 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 7/-1/-1->4->5 [7] 7/-1/-1->4->5
kaveh:746285:746447 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff,00000000,00000000
kaveh:746286:746446 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 4/-1/-1->5->2 [3] 4/-1/-1->5->2 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 4/-1/-1->5->2 [7] 4/-1/-1->5->2
kaveh:746286:746446 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff,00000000,00000000
kaveh:746288:746448 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 1/-1/-1->6->7 [3] 1/-1/-1->6->7 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 1/-1/-1->6->7 [7] 1/-1/-1->6->7
kaveh:746288:746448 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff,00000000,00000000
kaveh:746290:746445 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] 6/-1/-1->7->4 [3] 6/-1/-1->7->4 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] 6/-1/-1->7->4 [7] 6/-1/-1->7->4
kaveh:746290:746445 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff,00000000,00000000
kaveh:746281:746444 [0] NCCL INFO Channel 00/08 : 0 1 2 3 4 5 6 7
kaveh:746281:746444 [0] NCCL INFO Channel 01/08 : 0 1 2 3 4 5 6 7
kaveh:746281:746444 [0] NCCL INFO Channel 02/08 : 0 3 2 5 4 7 6 1
kaveh:746281:746444 [0] NCCL INFO Channel 03/08 : 0 3 2 5 4 7 6 1
kaveh:746281:746444 [0] NCCL INFO Channel 04/08 : 0 1 2 3 4 5 6 7
kaveh:746281:746444 [0] NCCL INFO Channel 05/08 : 0 1 2 3 4 5 6 7
kaveh:746281:746444 [0] NCCL INFO Channel 06/08 : 0 3 2 5 4 7 6 1
kaveh:746281:746444 [0] NCCL INFO Channel 07/08 : 0 3 2 5 4 7 6 1
kaveh:746282:746449 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] -1/-1/-1->1->6 [3] -1/-1/-1->1->6 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] -1/-1/-1->1->6 [7] -1/-1/-1->1->6
kaveh:746281:746444 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 3/-1/-1->0->-1 [3] 3/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 3/-1/-1->0->-1 [7] 3/-1/-1->0->-1
kaveh:746283:746450 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 5/-1/-1->2->3 [3] 5/-1/-1->2->3 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 5/-1/-1->2->3 [7] 5/-1/-1->2->3
kaveh:746284:746451 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 2/-1/-1->3->0 [3] 2/-1/-1->3->0 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 2/-1/-1->3->0 [7] 2/-1/-1->3->0
kaveh:746282:746449 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
kaveh:746281:746444 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
kaveh:746283:746450 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
kaveh:746284:746451 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
kaveh:746285:746447 [4] NCCL INFO Channel 00 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 00 : 5[a1000] -> 6[c1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 00 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 00 : 7[e1000] -> 0[1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 01 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 00 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 01 : 5[a1000] -> 6[c1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 01 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 01 : 7[e1000] -> 0[1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 00 : 3[61000] -> 4[81000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 00 : 0[1000] -> 1[25000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 04 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 01 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 00 : 1[25000] -> 2[41000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 04 : 5[a1000] -> 6[c1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 04 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 04 : 7[e1000] -> 0[1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 01 : 3[61000] -> 4[81000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 01 : 0[1000] -> 1[25000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 05 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 04 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 01 : 1[25000] -> 2[41000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 05 : 5[a1000] -> 6[c1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 05 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 05 : 7[e1000] -> 0[1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 04 : 3[61000] -> 4[81000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 04 : 0[1000] -> 1[25000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 05 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 04 : 1[25000] -> 2[41000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 05 : 0[1000] -> 1[25000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 05 : 3[61000] -> 4[81000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 05 : 1[25000] -> 2[41000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 02 : 0[1000] -> 3[61000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 03 : 0[1000] -> 3[61000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 02 : 2[41000] -> 5[a1000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 06 : 0[1000] -> 3[61000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 03 : 2[41000] -> 5[a1000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Channel 07 : 0[1000] -> 3[61000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 06 : 2[41000] -> 5[a1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 07 : 2[41000] -> 5[a1000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 02 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 03 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 02 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 02 : 6[c1000] -> 1[25000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 02 : 4[81000] -> 7[e1000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 06 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 03 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 03 : 6[c1000] -> 1[25000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 03 : 4[81000] -> 7[e1000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 07 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 06 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 06 : 6[c1000] -> 1[25000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 06 : 4[81000] -> 7[e1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 07 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 07 : 6[c1000] -> 1[25000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 07 : 4[81000] -> 7[e1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Connected all rings
kaveh:746284:746451 [3] NCCL INFO Connected all rings
kaveh:746283:746450 [2] NCCL INFO Channel 02 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 03 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 02 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 06 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 03 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 07 : 2[41000] -> 3[61000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 06 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 02 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 07 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 03 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 06 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 02 : 3[61000] -> 0[1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Connected all rings
kaveh:746286:746446 [5] NCCL INFO Connected all rings
kaveh:746282:746449 [1] NCCL INFO Channel 07 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 03 : 3[61000] -> 0[1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 02 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 06 : 3[61000] -> 0[1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 03 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 07 : 3[61000] -> 0[1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Connected all rings
kaveh:746290:746445 [7] NCCL INFO Connected all rings
kaveh:746281:746444 [0] NCCL INFO Connected all rings
kaveh:746282:746449 [1] NCCL INFO Connected all rings
kaveh:746285:746447 [4] NCCL INFO Channel 06 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 02 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 02 : 1[25000] -> 6[c1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 07 : 4[81000] -> 5[a1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 03 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 03 : 1[25000] -> 6[c1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 06 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 06 : 1[25000] -> 6[c1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 07 : 6[c1000] -> 7[e1000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 07 : 1[25000] -> 6[c1000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 02 : 5[a1000] -> 2[41000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 03 : 5[a1000] -> 2[41000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 06 : 5[a1000] -> 2[41000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 02 : 7[e1000] -> 4[81000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 07 : 5[a1000] -> 2[41000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 03 : 7[e1000] -> 4[81000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 06 : 7[e1000] -> 4[81000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 07 : 7[e1000] -> 4[81000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 00 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 00 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 01 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 01 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 00 : 2[41000] -> 1[25000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 00 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 04 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 04 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 01 : 2[41000] -> 1[25000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 01 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746284:746451 [3] NCCL INFO Channel 05 : 3[61000] -> 2[41000] via P2P/IPC
kaveh:746290:746445 [7] NCCL INFO Channel 05 : 7[e1000] -> 6[c1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 04 : 2[41000] -> 1[25000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 04 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 00 : 4[81000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 00 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746283:746450 [2] NCCL INFO Channel 05 : 2[41000] -> 1[25000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 00 : 6[c1000] -> 5[a1000] via P2P/IPC
kaveh:746286:746446 [5] NCCL INFO Channel 05 : 5[a1000] -> 4[81000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 01 : 4[81000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 01 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 01 : 6[c1000] -> 5[a1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 04 : 4[81000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 04 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 04 : 6[c1000] -> 5[a1000] via P2P/IPC
kaveh:746285:746447 [4] NCCL INFO Channel 05 : 4[81000] -> 3[61000] via P2P/IPC
kaveh:746282:746449 [1] NCCL INFO Channel 05 : 1[25000] -> 0[1000] via P2P/IPC
kaveh:746288:746448 [6] NCCL INFO Channel 05 : 6[c1000] -> 5[a1000] via P2P/IPC
kaveh:746281:746444 [0] NCCL INFO Connected all trees
kaveh:746281:746444 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746281:746444 [0] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746290:746445 [7] NCCL INFO Connected all trees
kaveh:746290:746445 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746290:746445 [7] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746283:746450 [2] NCCL INFO Connected all trees
kaveh:746283:746450 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746283:746450 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746285:746447 [4] NCCL INFO Connected all trees
kaveh:746285:746447 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746285:746447 [4] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746282:746449 [1] NCCL INFO Connected all trees
kaveh:746282:746449 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746282:746449 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746284:746451 [3] NCCL INFO Connected all trees
kaveh:746284:746451 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746284:746451 [3] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746286:746446 [5] NCCL INFO Connected all trees
kaveh:746286:746446 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746286:746446 [5] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746288:746448 [6] NCCL INFO Connected all trees
kaveh:746288:746448 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
kaveh:746288:746448 [6] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
kaveh:746286:746446 [5] NCCL INFO comm 0x7fc508002fb0 rank 5 nranks 8 cudaDev 5 busId a1000 - Init COMPLETE
kaveh:746283:746450 [2] NCCL INFO comm 0x7f6138002fb0 rank 2 nranks 8 cudaDev 2 busId 41000 - Init COMPLETE
kaveh:746282:746449 [1] NCCL INFO comm 0x7f94a8002fb0 rank 1 nranks 8 cudaDev 1 busId 25000 - Init COMPLETE
kaveh:746288:746448 [6] NCCL INFO comm 0x7f5170002fb0 rank 6 nranks 8 cudaDev 6 busId c1000 - Init COMPLETE
kaveh:746285:746447 [4] NCCL INFO comm 0x7fa414002fb0 rank 4 nranks 8 cudaDev 4 busId 81000 - Init COMPLETE
kaveh:746290:746445 [7] NCCL INFO comm 0x7f336c002fb0 rank 7 nranks 8 cudaDev 7 busId e1000 - Init COMPLETE
kaveh:746284:746451 [3] NCCL INFO comm 0x7f40e4002fb0 rank 3 nranks 8 cudaDev 3 busId 61000 - Init COMPLETE
kaveh:746281:746444 [0] NCCL INFO comm 0x7f79bc002fb0 rank 0 nranks 8 cudaDev 0 busId 1000 - Init COMPLETE
kaveh:746281:746281 [0] NCCL INFO Launch mode Parallel
I was wondering if anyone has had a similar issue and knows how to resolve it.
Thanks!