Brief
Attempted to run distributed training on all 4 GPUs for the first time, using TF’s simple MirroredStrategy (which uses NCCL all-reduce), and immediately got an unhandled cuda error in nccl_ops.cc. (see brief error log below) Due to another error described here, which causes the notebook kernel to restart, I cannot at this time reproduce the error to provide a longer log.
Configuration
Lambda QUAD Titan V (2018)
Lambda Stack
Ubuntu 18.04
Anaconda (conda 4.8.3) environment with TensorFlow 2.2.0
Running this notebook in jupyter notebook
Error log (short)
2020-06-10 13:36:56.319859: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-10 13:36:57.698341: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-10 13:37:04.630742: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at nccl_ops.cc:104 : Unknown: Error invoking NCCL: unhandled cuda error
Note
I can’t reproduce the error to provide a longer log, because of the newly occurring error described in Started getting TensorFlow profiler error after CUDA update, which causes the kernel to be restarted.