BERT Multi-GPU TensorFlow and Horovod for prediction - no improvement

Question: is there any way that your combination of BERT and Horovod can accommodate the fact that, for SQuAD, the processing of each question against the paragraphs of a document is completely independent of the processing of each other question?

I ask because I have noticed no improvement in performance for prediction when using BERT Multi-GPU Tensorflow and Horovod.

Here are the details:

I’ve set up BERT Multi-GPU implementation using TensorFlow and Horovod in hopes that it would both speed up run_squad performance for prediction and make use of both gpus on a host machine. Following the instructions, it appears that both gpus are indeed operating at or near capacity, and that more than one cpu is being used (so nice to see multi-cpu processing, too).

What I also observe is that the artifacts that are created in the result/output directory and that are also created in worker 1’s subdirectory are identical - the eval.tf_record files, the nbest_prediction files, …

BTW, the elapsed processing time is just slightly faster without the horovod adaptations.

So it appears that this very cool approach using horovod does not help out with prediction. Is that correct?

You can see a related conversation in a horovod issue.

We have noticed a similar issue with PyTorch version of Transformer for NLP.
This is the system config:
It is a Lambda blade server with 8 RTX 3090 GPUs, NVIDIA Driver Version: 460.56, CUDA Version: 11.2.
Ubuntu 20.04 LTS, Dual AMD EPYC 7302 16-Core CPUs, 512GB of RAM.

We did a quick profiling of a single training step of the transformers for 1, 2 and 4 GPUs on a troubleshooting data set. It confirms our suspicion that model parameter copying is what slows things down.
Below are the typical profiles: there are a lot of details but you can just look at the total time and memcpy, ReduceAdd and _copy entries. 1 GPU runs on one single batch of 12 examples, while 2 GPUs run on 2 batches of 12 examples (24 total), and 4 GPUs on 4 batches of 12. But you can see here the runtime is superlinear, 2 GPUs take more than twice the time of 1 GPU, and 4 GPUs take more than 4 times the runtime of 1 GPU.

See the details below. Any idea whether it is due to the red flag 3 mentioned here [Why My Multi-GPU training is slow? | by Chuan Li | Medium] or some other (hardware) issue?

1 GPU:
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
autograd::engine::evaluate_function: AddmmBackward0 0.51% 578.000us 6.44% 7.329ms 192.868us 0.000us 0.00% 45.853ms 1.207ms 38
AddmmBackward0 0.35% 397.000us 4.38% 4.981ms 131.079us 0.000us 0.00% 44.273ms 1.165ms 38
aten::mm 2.28% 2.594ms 3.48% 3.960ms 52.105us 44.273ms 49.06% 44.273ms 582.539us 76
model_inference 22.45% 25.538ms 30.82% 35.056ms 35.056ms 0.000us 0.00% 30.270ms 30.270ms 1
aten::linear 0.19% 221.000us 2.39% 2.713ms 71.395us 0.000us 0.00% 22.575ms 594.079us 38
aten::addmm 1.21% 1.379ms 1.79% 2.031ms 53.447us 22.575ms 25.02% 22.575ms 594.079us 38
ampere_sgemm_128x128_nt 0.00% 0.000us 0.00% 0.000us 0.000us 15.953ms 17.68% 15.953ms 839.632us 19
ampere_sgemm_128x128_nn 0.00% 0.000us 0.00% 0.000us 0.000us 15.783ms 17.49% 15.783ms 830.684us 19
ampere_sgemm_128x128_tn 0.00% 0.000us 0.00% 0.000us 0.000us 15.222ms 16.87% 15.222ms 801.158us 19
ampere_sgemm_128x64_nt 0.00% 0.000us 0.00% 0.000us 0.000us 7.604ms 8.43% 7.604ms 245.290us 31


Self CPU time total: 113.737ms
Self CUDA time total: 90.240ms

2 GPU:
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
autograd::engine::evaluate_function: BroadcastBackwa… 0.09% 281.000us 1.51% 4.806ms 4.806ms
0.000us 0.00% 295.685ms 295.685ms 1
BroadcastBackward 0.03% 99.000us 1.43% 4.525ms 4.525ms 0.000us 0.00% 295.685ms 295.685ms 1
ReduceAddCoalesced 0.69% 2.187ms 1.39% 4.426ms 4.426ms 294.765ms 56.69% 295.685ms 295.685ms 1
ncclKernel_Reduce_RING_LL_Sum_float(ncclWorkElem) 0.00% 0.000us 0.00% 0.000us 0.000us 294.765ms 56.69% 294.765ms 7.018ms 42
aten::copy_ 0.44% 1.395ms 2.65% 8.421ms 97.919us 149.125ms 28.68% 149.125ms 1.734ms 86
model_inference 9.58% 30.398ms 19.33% 61.342ms 61.342ms 0.000us 0.00% 148.474ms 148.474ms 1
DataParallel.forward 6.62% 21.023ms 9.62% 30.535ms 30.535ms 0.000us 0.00% 148.464ms 148.464ms 1
Broadcast 0.29% 929.000us 2.01% 6.382ms 6.382ms 0.000us 0.00% 85.673ms 85.673ms 1
Memcpy DtoH (Device → Device) 0.00% 0.000us 0.00% 0.000us 0.000us 73.949ms 14.22% 73.949ms 3.521ms 21
Memcpy HtoD (Device → Device) 0.00% 0.000us 0.00% 0.000us 0.000us 73.422ms 14.12% 73.422ms 3.496ms 21


Self CPU time total: 317.421ms
Self CUDA time total: 519.947ms

4 GPU:
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
autograd::engine::evaluate_function: BroadcastBackwa… 0.06% 616.000us 0.75% 7.585ms 7.585ms 0.000us 0.00% 1.911s 1.911s 1
BroadcastBackward 0.02% 228.000us 0.69% 6.969ms 6.969ms 0.000us 0.00% 1.911s 1.911s 1
ReduceAddCoalesced 0.31% 3.176ms 0.67% 6.741ms 6.741ms 1.909s 67.18% 1.911s 1.911s 1
ncclKernel_Reduce_RING_LL_Sum_float(ncclWorkElem) 0.00% 0.000us 0.00% 0.000us 0.000us 1.909s 67.18% 1.909s 22.728ms 84
aten::copy_ 0.42% 4.199ms 25.48% 257.412ms 1.369ms 630.119ms 22.17% 630.119ms 3.352ms 188
model_inference 4.30% 43.429ms 32.94% 332.815ms 332.815ms 0.000us 0.00% 627.385ms 627.385ms 1
DataParallel.forward 2.77% 27.976ms 28.49% 287.843ms 287.843ms 0.000us 0.00% 627.362ms 627.362ms 1
Gather 0.02% 247.000us 18.12% 183.119ms 91.559ms 0.000us 0.00% 377.322ms 188.661ms 2
Memcpy DtoH (Device → Device) 0.00% 0.000us 0.00% 0.000us 0.000us 313.537ms 11.03% 313.537ms 4.977ms 63
Memcpy HtoD (Device → Device) 0.00% 0.000us 0.00% 0.000us 0.000us 312.099ms 10.98% 312.099ms 4.954ms 63


Self CPU time total: 1.010s
Self CUDA time total: 2.842s