Question: is there any way that your combination of BERT and Horovod can accommodate the fact that, for SQuAD, the processing of each question against the paragraphs of a document is completely independent of the processing of each other question?
I ask because I have noticed no improvement in performance for prediction when using BERT Multi-GPU Tensorflow and Horovod.
Here are the details:
I’ve set up BERT Multi-GPU implementation using TensorFlow and Horovod in hopes that it would both speed up run_squad performance for prediction and make use of both gpus on a host machine. Following the instructions, it appears that both gpus are indeed operating at or near capacity, and that more than one cpu is being used (so nice to see multi-cpu processing, too).
What I also observe is that the artifacts that are created in the result/output directory and that are also created in worker 1’s subdirectory are identical - the eval.tf_record files, the nbest_prediction files, …
BTW, the elapsed processing time is just slightly faster without the horovod adaptations.
So it appears that this very cool approach using horovod does not help out with prediction. Is that correct?
You can see a related conversation in a horovod issue.
We have noticed a similar issue with PyTorch version of Transformer for NLP.
This is the system config:
It is a Lambda blade server with 8 RTX 3090 GPUs, NVIDIA Driver Version: 460.56, CUDA Version: 11.2.
Ubuntu 20.04 LTS, Dual AMD EPYC 7302 16-Core CPUs, 512GB of RAM.
We did a quick profiling of a single training step of the transformers for 1, 2 and 4 GPUs on a troubleshooting data set. It confirms our suspicion that model parameter copying is what slows things down.
Below are the typical profiles: there are a lot of details but you can just look at the total time and memcpy, ReduceAdd and _copy entries. 1 GPU runs on one single batch of 12 examples, while 2 GPUs run on 2 batches of 12 examples (24 total), and 4 GPUs on 4 batches of 12. But you can see here the runtime is superlinear, 2 GPUs take more than twice the time of 1 GPU, and 4 GPUs take more than 4 times the runtime of 1 GPU.