Getting unhandled cuda error when trying to use TF distributed strategy for training on 4 GPUs

Brief

Attempted to run distributed training on all 4 GPUs for the first time, using TF’s simple MirroredStrategy (which uses NCCL all-reduce), and immediately got an unhandled cuda error in nccl_ops.cc. (see brief error log below) Due to another error described here, which causes the notebook kernel to restart, I cannot at this time reproduce the error to provide a longer log.

Configuration

Lambda QUAD Titan V (2018)
Lambda Stack
Ubuntu 18.04
Anaconda (conda 4.8.3) environment with TensorFlow 2.2.0
Running this notebook in jupyter notebook

Error log (short)

2020-06-10 13:36:56.319859: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-10 13:36:57.698341: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-10 13:37:04.630742: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at nccl_ops.cc:104 : Unknown: Error invoking NCCL: unhandled cuda error

Note

I can’t reproduce the error to provide a longer log, because of the newly occurring error described in Started getting TensorFlow profiler error after CUDA update, which causes the kernel to be restarted.

Possibly solved

After fixing this other error which was blocking replication of this one, this error has also gone away. It is possible that the software updates shown in the other error description were the solution.