Hello.
Every time I try to fit an LSTM in Tensorflow on our lambda workstation, the python kernel dies with the following error message:
Could not load library libcudnn_adv_train.so.8. Error: libcudnn_ops_train.so.8: cannot open shared object file: No such file or directory
Please make sure libcudnn_adv_train.so.8 is in your library path!
The complete error stack is at the bottom of this message.
This problem appears to be similar to this thread from a year ago that indicates that lambda stack does not (yet?) support libcudnn
v8.
To fix this, I’ve tried to invoking the update command below from the lambda stack webpage:
sudo apt-get update && sudo apt-get dist-upgrade
This does not fix the problem.
Is this a known problem with lambda stack? I’m running a preinstalled lambda stack on a lambda workstation and am not aware of having taken any other action that would update or desync tensorflow and cuDNN.
Thank you in advance for any help you might be able to provide.
Invalid MIT-MAGIC-COOKIE-1 key2021-11-12 10:29:13.341802: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX512F
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-12 10:29:14.026866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22357 MB memory: → device: 0, name: NVIDIA RTX A5000, pci bus id: 0000:67:00.0, compute capability: 8.6
2021-11-12 10:29:14.027819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 21993 MB memory: → device: 1, name: NVIDIA RTX A5000, pci bus id: 0000:68:00.0, compute capability: 8.6
2021-11-12 10:29:15.356581: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-12 10:29:16.144405: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8201
Could not load library libcudnn_adv_train.so.8. Error: libcudnn_ops_train.so.8: cannot open shared object file: No such file or directory
Please make sure libcudnn_adv_train.so.8 is in your library path!
[lambda-dual:06296] *** Process received signal ***
[lambda-dual:06296] Signal: Aborted (6)
[lambda-dual:06296] Signal code: (-6)
[lambda-dual:06296] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f1665725210]
[lambda-dual:06296] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f166572518b]
[lambda-dual:06296] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f1665704859]
[lambda-dual:06296] [ 3] /usr/lib/python3/dist-packages/tensorflow/python/…/libcudnn.so.8(cudnnRNNForwardTraining+0x230)[0x7f15a0d96480]
[lambda-dual:06296] [ 4] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(cudnnRNNForwardTraining+0x8c)[0x7f15a5fb48cc]
[lambda-dual:06296] [ 5] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudnnSupport16DoRnnForwardImplIfEEN10tensorflow6StatusEPNS_6StreamERKNS0_18CudnnRnnDescriptorERKNS0_32CudnnRnnSequenceTensorDescriptorERKNS_12DeviceMemoryIT_EERKNSD_IiEERKNS0_29CudnnRnnStateTensorDescriptorESH_SN_SH_SH_SC_PSF_SN_SO_SN_SO_bPNS_16ScratchAllocatorESQ_PNS_3dnn13ProfileResultE+0x1080)[0x7f15a5f88750]
[lambda-dual:06296] [ 6] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(_ZN15stream_executor3gpu12CudnnSupport12DoRnnForwardEPNS_6StreamERKNS_3dnn13RnnDescriptorERKNS4_27RnnSequenceTensorDescriptorERKNS_12DeviceMemoryIfEERKNSB_IiEERKNS4_24RnnStateTensorDescriptorESE_SK_SE_SE_SA_PSC_SK_SL_SK_SL_bPNS_16ScratchAllocatorESN_PNS4_13ProfileResultE+0x65)[0x7f15a5f88ec5]
[lambda-dual:06296] [ 7] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZN15stream_executor6Stream14ThenRnnForwardERKNS_3dnn13RnnDescriptorERKNS1_27RnnSequenceTensorDescriptorERKNS_12DeviceMemoryIfEERKNS8_IiEERKNS1_24RnnStateTensorDescriptorESB_SH_SB_SB_S7_PS9_SH_SI_SH_SI_bPNS_16ScratchAllocatorESK_PNS1_13ProfileResultE+0x93)[0x7f15c1b1d483]
[lambda-dual:06296] [ 8] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(+0xa83f526)[0x7f15babf4526]
[lambda-dual:06296] [ 9] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZN10tensorflow17CudnnRNNForwardOpIN5Eigen9GpuDeviceEfE25ComputeAndReturnAlgorithmEPNS_15OpKernelContextEPN15stream_executor3dnn15AlgorithmConfigEbbi+0x5f1)[0x7f15babfd631]
[lambda-dual:06296] [10] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZN10tensorflow17CudnnRNNForwardOpIN5Eigen9GpuDeviceEfE7ComputeEPNS_15OpKernelContextE+0x59)[0x7f15bac07959]
[lambda-dual:06296] [11] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x24e)[0x7f15a563b4de]
[lambda-dual:06296] [12] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(+0x976e48)[0x7f15a5731e48]
[lambda-dual:06296] [13] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x2a5)[0x7f15b4b7aed5]
[lambda-dual:06296] [14] /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.cpython-38-x86_64-linux-gnu.so(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x47)[0x7f15b4b77eb7]
[lambda-dual:06296] [15] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(+0xe236ef)[0x7f15a5bde6ef]
[lambda-dual:06296] [16] /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f16656c5609]
[lambda-dual:06296] [17] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f1665801293]
[lambda-dual:06296] *** End of error message ***
~