Hello, I met an issue that seems to only be present in the lambda workstation so far.
Here is the link to the issue
You have to love the name… VRAM-hungry LSTM monster.
I took a quick look. I was able to repeat the hang/dead kernel in Jupyter notebook and it failing on the command line.
From the command line I noticed it was not finding ‘libcudnn_ops_train.so.8’.
More information below the workaround.
And I was able to fix this by…
- Stopped jupyter notebook
- Set the LD_LIBRARY_PATH (which I thought was set in ld.so.conf.d
$ export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow:$LD_LIBRARY_PATH- I will look at why, this is a quick work around
- Start Jupyter notebook
$ jupyter notebook - Run the code as is in Jupyter Notebook
$ web gui tool - which you obviously know
From the command line I just commented out the two nvidia-smi, since I was watching both in nvtop and with:
$ nvidia-smi --query-gpu=index,pci.bus_id,fan.speed,utilization.gpu,utilization.memory,temperature.gpu,power.draw --format=csv -l
Here is the longer form of the message from the command line:
2022-06-09 23:35:37.010639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 8063 MB memory: → device: 1, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:21:00.0, compute capability: 8.6
2022-06-09 23:35:38.611401: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8303
Could not load library libcudnn_adv_train.so.8. Error: libcudnn_ops_train.so.8: cannot open shared object file: No such file or directory
[lambda-dual:705714] *** Process received signal ***
[lambda-dual:705714] Signal: Aborted (6)
[lambda-dual:705714] Signal code: (-6)
… then it hung … until eventually the kernel in jupyter notebook died.
NOTE you can turn off much of the noisy tensorflow default messages with:
$ TF_CPP_MIN_LOG_LEVEL=3 python LSTM-hell.py
The workaround:
$ export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow:$LD_LIBRARY_PATH
$ python LSTM-hell.py
… tensorflow noisy messages …
2022-06-09 23:37:27.934136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 4138 MB memory: → device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6
2022-06-09 23:37:27.934442: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-09 23:37:27.934648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 8063 MB memory: → device: 1, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:21:00.0, compute capability: 8.6
2022-06-09 23:37:29.485452: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8303
1/1 [==============================] - 2s 2s/step - loss: 5.0529
wsadmin@AIML1001:/usr/lib/python3/dist-packages$ find tensorflow*/ libcudnn_ops_train.so.8
find: ‘libcudnn_ops_train.so.8’: No such file or directory
Hello, thanks for the detailed response. What is your LD_LIBRARY_PATH? Mine was not set before running export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow:$LD_LIBRARY_PATH
so it became /usr/lib/python3/dist-packages/tensorflow:
And here is me running the workaround:
wsadmin@AIML1001:/home/mher/projects/Untitled Folder$ printenv | grep LD_
LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow:
wsadmin@AIML1001:/home/mher/projects/Untitled Folder$ sudo TF_CPP_MIN_LOG_LEVEL=3 python3 LSTM-hell.py
No protocol specified
Traceback (most recent call last):
File "LSTM-hell.py", line 52, in <module>
regressor.fit(x, y_labels, batch_size=1)
File "/usr/lib/python3/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/lib/python3/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 1, 1, 1, 2, 1, 1]
[[{{node CudnnRNN}}]]
[[sequential/lstm/PartitionedCall]] [Op:__inference_train_function_2337]
Here is the LSTM_hell.py
file content:
wsadmin@AIML1001:/home/mher/projects/Untitled Folder$ cat LSTM-hell.py
#!/usr/bin/env python
# coding: utf-8
# In[1]:
import tensorflow as tf
import numpy as np
import pandas as pd
import tensorflow.keras as k
from tensorflow.keras.layers import LSTM, Dense, Reshape, RepeatVector
from tensorflow.keras.models import Sequential
# In[2]:
y_posterior = lambda x: 2.71 ** x + 5 * x + 1.2
x = np.random.normal(0, 1, (1, 2))
y_labels = y_posterior(x)[:, 0]
# In[3]:
y_labels
# In[4]:
#get_ipython().system('nvidia-smi')
# In[5]:
regressor = Sequential()
# regressor.add(Dense(1, input_shape=(1,)))
regressor.add(LSTM(1,
batch_input_shape=(1, 2, 1)))
# regressor.add(Dense(1))
regressor.compile(optimizer = 'sgd', loss = 'mean_squared_error')
# In[6]:
regressor.fit(x, y_labels, batch_size=1)
# In[7]:
x.shape
# In[8]:
y_labels.shape
# In[9]:
x
# In[10]:
#get_ipython().system('nvidia-smi')
# In[ ]:
import sys
sys.version
That is a good point.
I should have mentioned it is located in /usr/lib/python3/dist-packages/tensorflow if you have installed Lambda Stack for libcudnn_ops_train.so.8
If you do not have lambda stack you would need to install cudnn from NVIDIA, where you need to register to get to the cudnn download page.
Cudnn: CUDA Deep Neural Network (cuDNN) | NVIDIA Developer
And you need to make sure wherever you install it the LD_LIBRARY_PATH can be seen. Also so that the version is compatible with the driver and CUDA you have installed.
If you are running inside of anaconda. Anaconda does not setup the library paths for packages.
You should be able to find the path, the likely locations would be:
- /usr
- /usr/local - that is only suppose to be used for local sites software
- /opt/<vendor>/<product>/<version> - standard for Vendor software
- ~/Anaconda3 or I use: find ${CONDA_PREFIX} -name ‘libcudnn.so*’
LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:${LD_LIBRARY_PATH} jupyter notebook
Let me know if that helps.
Or if you run directly from python (with the nvidia-smi lines commented out)
python LSTM-hell.py | tee -a LSTM-hell.txt
And send the LSTM-hell.txt
Apologies if I’m messing up anywhere along the steps to the workaround but now I notice that I actually have the files under that directory:
wsadmin@AIML1001:/usr/lib/python3/dist-packages/tensorflow$ ls
_api distribute libcudnn_adv_infer.so.8 libcudnn_cnn_train.so.8 libcudnn.so.8 __pycache__
compiler include libcudnn_adv_train.so.8 libcudnn_ops_infer.so.8 libtensorflow_framework.so.2 python
core __init__.py libcudnn_cnn_infer.so.8 libcudnn_ops_train.so.8 lite tools
And I updated the Lambda Stack a week ago (this issue persisted despite the upgrade) using the command mentioned in the official website:
sudo apt-get update && sudo apt-get dist-upgrade
I don’t use anaconda
Now when I run
wsadmin@AIML1001:/home/mher/projects/Untitled Folder$ export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow
wsadmin@AIML1001:/home/mher/projects/Untitled Folder$ sudo python3 LSTM-hell.py | sudo tee -a LSTM-hell.txt
No protocol specified
2022-06-10 12:27:49.603201: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.603442: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.603651: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.630590: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.630879: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.631090: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.631295: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.631495: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.631694: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.816451: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.816679: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.816871: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.817064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.817246: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.817426: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.817607: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.817787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.817967: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.611852: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.612113: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.612329: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.612530: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.612729: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.612922: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.613117: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.613318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22278 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:29:00.0, compute capability: 8.6
2022-06-10 12:27:50.613597: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.613775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22278 MB memory: -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:41:00.0, compute capability: 8.6
2022-06-10 12:27:50.613972: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.614148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22265 MB memory: -> device: 2, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:61:00.0, compute capability: 8.6
2022-06-10 12:27:52.155343: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8303
2022-06-10 12:27:52.156458: E tensorflow/stream_executor/dnn.cc:868] <unknown cudnn status: 14>
in tensorflow/stream_executor/cuda/cuda_dnn.cc(2019): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2022-06-10 12:27:52.156485: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at cudnn_rnn_ops.cc:1563 : INTERNAL: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 1, 1, 1, 2, 1, 1]
Traceback (most recent call last):
File "LSTM-hell.py", line 52, in <module>
regressor.fit(x, y_labels, batch_size=1)
File "/usr/lib/python3/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/lib/python3/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 1, 1, 1, 2, 1, 1]
[[{{node CudnnRNN}}]]
[[sequential/lstm/PartitionedCall]] [Op:__inference_train_function_2337]
wsadmin@AIML1001:/home/mher/projects/Untitled Folder$ cat LSTM-hell.txt
So the LSTM-hell.txt appears to be empty
The output is exactly the same after running
export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow:
instead of
export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow
How do I check that?
How do I check that? I assume it would be the correct version since the stack update should’ve taken care of that, and I didn’t encounter any noticeable errors when updating it.
I assume that I don’t need to do this since I’m not using anaconda
I decided to first fix the issue using just normal python so I never did these steps, I assume that they are non-critical to the workaround
I think it will be easier working with you directly. Please send a email to support@lambdal.com - and I can work directly with you.
It is good you now have the libcudnn.so and the other parts of Lambda Stack.
And that is working correctly on my machine as long as I set the LD_LIBRARY_PATH
I was giving you options, since it was not clear what you had installed or how you were running. (1. Lambda Stack 2. Without Lambda Stack what you need to do. 3. In Anaconda). Lambda Stack is the simplest way.
There should not be any sudo required. The ‘tee -a’ was just to capture stdout so we can skip that.
$ export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow
$ LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow python LSTM-hell.py
… lots of tensorflow informational messages …
1/1 [==============================] - 1s 1s/step - loss: 5.9683
or
$ TF_CPP_MIN_LOG_LEVEL=1 LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow python LSTM-hell.py
1/1 [==============================] - 1s 1s/step - loss: 5.9683
I just added the ‘TF_CPP_MIN_LOG_LEVEL=1’ to reduce the noise.
NOTE: If I do not set the LD_LIBRARY_PATH:
$ TF_CPP_MIN_LOG_LEVEL=1 python LSTM-hell.py
Invalid MIT-MAGIC-COOKIE-1 keyCould not load library libcudnn_adv_train.so.8. Error: libcudnn_ops_train.so.8: cannot open shared object file: No such file or directory
[lambda-dual:867480] *** Process received signal ***
[lambda-dual:867480] Signal: Aborted (6)
[lambda-dual:867480] Signal code: (-6)
[lambda-dual:867480] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f7963139090]
[lambda-dual:867480] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f796313900b]
[lambda-dual:867480] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7963118859]
[lambda-dual:867480] [ 3] /usr/lib/python3/dist-packages/tensorflow/python/…/libcudnn.so.8(cudnnRNNForwardTraining+0x216)[0x7f78dc112af6]
[lambda-dual:867480] [ 4] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(cudnnRNNForwardTraining+0x8c)[0x7f7914fb758c]
…
This worked! Thank you so much