Getting unhandled cuda error when trying to use TF distributed strategy for training on 4 GPUs

ivogeorg · June 11, 2020, 10:12pm

Brief

Attempted to run distributed training on all 4 GPUs for the first time, using TF’s simple MirroredStrategy (which uses NCCL all-reduce), and immediately got an unhandled cuda error in nccl_ops.cc. (see brief error log below) Due to another error described here, which causes the notebook kernel to restart, I cannot at this time reproduce the error to provide a longer log.

Configuration

Lambda QUAD Titan V (2018)
Lambda Stack
Ubuntu 18.04
Anaconda (conda 4.8.3) environment with TensorFlow 2.2.0
Running this notebook in jupyter notebook

Error log (short)

2020-06-10 13:36:56.319859: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-10 13:36:57.698341: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-10 13:37:04.630742: W tensorflow/core/framework/op_kernel.cc:1753] OP_REQUIRES failed at nccl_ops.cc:104 : Unknown: Error invoking NCCL: unhandled cuda error

Note

I can’t reproduce the error to provide a longer log, because of the newly occurring error described in Started getting TensorFlow profiler error after CUDA update, which causes the kernel to be restarted.

ivogeorg · June 12, 2020, 2:44am

Possibly solved

After fixing this other error which was blocking replication of this one, this error has also gone away. It is possible that the software updates shown in the other error description were the solution.

Topic		Replies	Views
Started getting TensorFlow profiler error after CUDA update Technical Help	2	5247	February 8, 2021
Failed call to cuInit: CUDA_ERROR_UNKNOWN Technical Help	1	7697	November 3, 2017
Desktop windows 10 Cuda issue Technical Help	5	2718	February 9, 2019
Cuda no longer work when installing new ubuntu update	3	2279	October 4, 2018
CUDA runtime implicit initialization on GPU:0 Technical Help	0	1002	December 3, 2020