Could not load library libcudnn_adv_train.so.8 error on lambda workstation

Hello.

Every time I try to fit an LSTM in Tensorflow on our lambda workstation, the python kernel dies with the following error message:

Could not load library libcudnn_adv_train.so.8. Error: libcudnn_ops_train.so.8: cannot open shared object file: No such file or directory
Please make sure libcudnn_adv_train.so.8 is in your library path!

The complete error stack is at the bottom of this message.

This problem appears to be similar to this thread from a year ago that indicates that lambda stack does not (yet?) support libcudnn v8.

To fix this, I’ve tried to invoking the update command below from the lambda stack webpage:

sudo apt-get update && sudo apt-get dist-upgrade

This does not fix the problem.

Is this a known problem with lambda stack? I’m running a preinstalled lambda stack on a lambda workstation and am not aware of having taken any other action that would update or desync tensorflow and cuDNN.

Thank you in advance for any help you might be able to provide.

Invalid MIT-MAGIC-COOKIE-1 key2021-11-12 10:29:13.341802: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-12 10:29:14.026866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22357 MB memory: → device: 0, name: NVIDIA RTX A5000, pci bus id: 0000:67:00.0, compute capability: 8.6
2021-11-12 10:29:14.027819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 21993 MB memory: → device: 1, name: NVIDIA RTX A5000, pci bus id: 0000:68:00.0, compute capability: 8.6
2021-11-12 10:29:15.356581: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-12 10:29:16.144405: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8201
Could not load library libcudnn_adv_train.so.8. Error: libcudnn_ops_train.so.8: cannot open shared object file: No such file or directory
Please make sure libcudnn_adv_train.so.8 is in your library path!
[lambda-dual:06296] *** Process received signal ***
[lambda-dual:06296] Signal: Aborted (6)
[lambda-dual:06296] Signal code: (-6)
[lambda-dual:06296] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f1665725210]
[lambda-dual:06296] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f166572518b]
[lambda-dual:06296] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f1665704859]
[lambda-dual:06296] [ 3] /usr/lib/python3/dist-packages/tensorflow/python/…/libcudnn.so.8(cudnnRNNForwardTraining+0x230)[0x7f15a0d96480]
[lambda-dual:06296] [ 4] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(cudnnRNNForwardTraining+0x8c)[0x7f15a5fb48cc]
[lambda-dual:06296] [ 5] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudnnSupport16DoRnnForwardImplIfEEN10tensorflow6StatusEPNS_6StreamERKNS0_18CudnnRnnDescriptorERKNS0_32CudnnRnnSequenceTensorDescriptorERKNS_12DeviceMemoryIT_EERKNSD_IiEERKNS0_29CudnnRnnStateTensorDescriptorESH_SN_SH_SH_SC_PSF_SN_SO_SN_SO_bPNS_16ScratchAllocatorESQ_PNS_3dnn13ProfileResultE+0x1080)[0x7f15a5f88750]
[lambda-dual:06296] [ 6] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudnnSupport12DoRnnForwardEPNS_6StreamERKNS_3dnn13RnnDescriptorERKNS4_27RnnSequenceTensorDescriptorERKNS_12DeviceMemoryIfEERKNSB_IiEERKNS4_24RnnStateTensorDescriptorESE_SK_SE_SE_SA_PSC_SK_SL_SK_SL_bPNS_16ScratchAllocatorESN_PNS4_13ProfileResultE+0x65)[0x7f15a5f88ec5]
[lambda-dual:06296] [ 7] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZN15stream_executor6Stream14ThenRnnForwardERKNS_3dnn13RnnDescriptorERKNS1_27RnnSequenceTensorDescriptorERKNS_12DeviceMemoryIfEERKNS8_IiEERKNS1_24RnnStateTensorDescriptorESB_SH_SB_SB_S7_PS9_SH_SI_SH_SI_bPNS_16ScratchAllocatorESK_PNS1_13ProfileResultE+0x93)[0x7f15c1b1d483]
[lambda-dual:06296] [ 8] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(+0xa83f526)[0x7f15babf4526]
[lambda-dual:06296] [ 9] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZN10tensorflow17CudnnRNNForwardOpIN5Eigen9GpuDeviceEfE25ComputeAndReturnAlgorithmEPNS_15OpKernelContextEPN15stream_executor3dnn15AlgorithmConfigEbbi+0x5f1)[0x7f15babfd631]
[lambda-dual:06296] [10] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZN10tensorflow17CudnnRNNForwardOpIN5Eigen9GpuDeviceEfE7ComputeEPNS_15OpKernelContextE+0x59)[0x7f15bac07959]
[lambda-dual:06296] [11] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x24e)[0x7f15a563b4de]
[lambda-dual:06296] [12] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(+0x976e48)[0x7f15a5731e48]
[lambda-dual:06296] [13] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x2a5)[0x7f15b4b7aed5]
[lambda-dual:06296] [14] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x47)[0x7f15b4b77eb7]
[lambda-dual:06296] [15] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(+0xe236ef)[0x7f15a5bde6ef]
[lambda-dual:06296] [16] /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f16656c5609]
[lambda-dual:06296] [17] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f1665801293]
[lambda-dual:06296] *** End of error message ***
~

After trying several different possible fixes, I did a clean reinstall of Ubuntu 20.04 and Lambda Stack. The problem persisted with the same error message.

I then did a second clean reinstall of Ubuntu 20.04 and manually installed graphics drivers (495.29.05), CUDA (11.5), cuDNN (8.3.1.22), and TensorFlow (2.7.0). My LSTM test script now works correctly.

I suspect that this is a problem with Lambda Stack and have forwarded my test script on to Lambda Labs technical support in case they’d like to try to reproduce the problem.

This is normally a issue with Anaconda or python venv/virtualenv (without --system-site-packages).

The issue is Anaconda removes all the normal system paths. And tensorflow/pytorch are built with cuDNN (CUDA, etc.). So it does not find the libraries. And they are including in the python path (which is likely part of the issue generally). /usr/python3/dist-packages:
/usr/lib/python3/dist-packages/tensorflow/libcudnn.so.8
/usr/lib/python3/dist-packages/torch/lib/libcudnn.so.8

And NVIDIA default install is in a non-standard location in /usr/local (for local site software).
Standard location for 3rd party software is: /opt///

The install of the NVIDIA can work, or the rpm for cuDNN can be done to work around this for using Anaconda or venv/virtualenv. Or with vnev/virtualenv you can use --system-site-packages

Thank you, but I wasn’t using Anaconda or a python virtual environment when I experienced this problem. In order to minimize confounding factors, I did a clean install of Ubuntu, a clean install of Lambda Stack, and ran an LSTM test script directly from the command line.

Since the same test script runs fine from the command line after I did a second clean install of Ubuntu and manually installed TensorFlow, I suspect the problem lies elsewhere.

FWIW, I’ve subsequently installed miniconda on the workstation, and TensorFlow works just fine within a conda environment, returning all the expected messages about how it’s finding the GPUs.

If anyone else is running into this problem: I could solve it by copying the libraries to the folder where tensorflow expects them to be.

like so:

sudo cp /usr/lib/python3/dist-packages/tensorflow/libcudnn* /usr/lib/x86_64-linux-gnu/

After that cudnn works for me without doing any updates whatsoever.

Yes, that can work also, but it is incomplete.

What that does is copying the cuDNN library in the system wide PATH.
That will help if you are not using the standard python which knows where to find it for the specific build of pytorch or tensorflow.

If you are using Anaconda, that installs all of its own python, pytorch, tensorflow, cuda-toolkit and cudnn. However, Anaconda does not point the LD_LIBRARY_PATH to the new location.
$ export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:${LD_LIBRARY_PATH}

And if you are using python venv or virtualenv - I normally install the appropriate cuDNN manually for those virtual environments for the given build I am using.

And docker would have that in the image if it had tensorflow/cuDNN.