We have noticed a similar issue with PyTorch version of Transformer for NLP.
This is the system config:
It is a Lambda blade server with 8 RTX 3090 GPUs, NVIDIA Driver Version: 460.56, CUDA Version: 11.2.
Ubuntu 20.04 LTS, Dual AMD EPYC 7302 16-Core CPUs, 512GB of RAM.
We did a quick profiling of a single training step of the transformers for 1, 2 and 4 GPUs on a troubleshooting data set. It confirms our suspicion that model parameter copying is what slows things down.
Below are the typical profiles: there are a lot of details but you can just look at the total time and memcpy, ReduceAdd and _copy entries. 1 GPU runs on one single batch of 12 examples, while 2 GPUs run on 2 batches of 12 examples (24 total), and 4 GPUs on 4 batches of 12. But you can see here the runtime is superlinear, 2 GPUs take more than twice the time of 1 GPU, and 4 GPUs take more than 4 times the runtime of 1 GPU.
See the details below. Any idea whether it is due to the red flag 3 mentioned here [https://medium.com/@c_61011/why-multi-gpu-training-is-not-faster-f439fe6dd6ec] or some other (hardware) issue?
1 GPU:
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
autograd::engine::evaluate_function: AddmmBackward0 0.51% 578.000us 6.44% 7.329ms 192.868us 0.000us 0.00% 45.853ms 1.207ms 38
AddmmBackward0 0.35% 397.000us 4.38% 4.981ms 131.079us 0.000us 0.00% 44.273ms 1.165ms 38
aten::mm 2.28% 2.594ms 3.48% 3.960ms 52.105us 44.273ms 49.06% 44.273ms 582.539us 76
model_inference 22.45% 25.538ms 30.82% 35.056ms 35.056ms 0.000us 0.00% 30.270ms 30.270ms 1
aten::linear 0.19% 221.000us 2.39% 2.713ms 71.395us 0.000us 0.00% 22.575ms 594.079us 38
aten::addmm 1.21% 1.379ms 1.79% 2.031ms 53.447us 22.575ms 25.02% 22.575ms 594.079us 38
ampere_sgemm_128x128_nt 0.00% 0.000us 0.00% 0.000us 0.000us 15.953ms 17.68% 15.953ms 839.632us 19
ampere_sgemm_128x128_nn 0.00% 0.000us 0.00% 0.000us 0.000us 15.783ms 17.49% 15.783ms 830.684us 19
ampere_sgemm_128x128_tn 0.00% 0.000us 0.00% 0.000us 0.000us 15.222ms 16.87% 15.222ms 801.158us 19
ampere_sgemm_128x64_nt 0.00% 0.000us 0.00% 0.000us 0.000us 7.604ms 8.43% 7.604ms 245.290us 31
Self CPU time total: 113.737ms
Self CUDA time total: 90.240ms
2 GPU:
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
autograd::engine::evaluate_function: BroadcastBackwa… 0.09% 281.000us 1.51% 4.806ms 4.806ms
0.000us 0.00% 295.685ms 295.685ms 1
BroadcastBackward 0.03% 99.000us 1.43% 4.525ms 4.525ms 0.000us 0.00% 295.685ms 295.685ms 1
ReduceAddCoalesced 0.69% 2.187ms 1.39% 4.426ms 4.426ms 294.765ms 56.69% 295.685ms 295.685ms 1
ncclKernel_Reduce_RING_LL_Sum_float(ncclWorkElem) 0.00% 0.000us 0.00% 0.000us 0.000us 294.765ms 56.69% 294.765ms 7.018ms 42
aten::copy_ 0.44% 1.395ms 2.65% 8.421ms 97.919us 149.125ms 28.68% 149.125ms 1.734ms 86
model_inference 9.58% 30.398ms 19.33% 61.342ms 61.342ms 0.000us 0.00% 148.474ms 148.474ms 1
DataParallel.forward 6.62% 21.023ms 9.62% 30.535ms 30.535ms 0.000us 0.00% 148.464ms 148.464ms 1
Broadcast 0.29% 929.000us 2.01% 6.382ms 6.382ms 0.000us 0.00% 85.673ms 85.673ms 1
Memcpy DtoH (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 73.949ms 14.22% 73.949ms 3.521ms 21
Memcpy HtoD (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 73.422ms 14.12% 73.422ms 3.496ms 21
Self CPU time total: 317.421ms
Self CUDA time total: 519.947ms
4 GPU:
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
autograd::engine::evaluate_function: BroadcastBackwa… 0.06% 616.000us 0.75% 7.585ms 7.585ms 0.000us 0.00% 1.911s 1.911s 1
BroadcastBackward 0.02% 228.000us 0.69% 6.969ms 6.969ms 0.000us 0.00% 1.911s 1.911s 1
ReduceAddCoalesced 0.31% 3.176ms 0.67% 6.741ms 6.741ms 1.909s 67.18% 1.911s 1.911s 1
ncclKernel_Reduce_RING_LL_Sum_float(ncclWorkElem) 0.00% 0.000us 0.00% 0.000us 0.000us 1.909s 67.18% 1.909s 22.728ms 84
aten::copy_ 0.42% 4.199ms 25.48% 257.412ms 1.369ms 630.119ms 22.17% 630.119ms 3.352ms 188
model_inference 4.30% 43.429ms 32.94% 332.815ms 332.815ms 0.000us 0.00% 627.385ms 627.385ms 1
DataParallel.forward 2.77% 27.976ms 28.49% 287.843ms 287.843ms 0.000us 0.00% 627.362ms 627.362ms 1
Gather 0.02% 247.000us 18.12% 183.119ms 91.559ms 0.000us 0.00% 377.322ms 188.661ms 2
Memcpy DtoH (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 313.537ms 11.03% 313.537ms 4.977ms 63
Memcpy HtoD (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 312.099ms 10.98% 312.099ms 4.954ms 63
Self CPU time total: 1.010s
Self CUDA time total: 2.842s