VRAM-hungry LSTM monster

Hello, I met an issue that seems to only be present in the lambda workstation so far.
Here is the link to the issue

You have to love the name… VRAM-hungry LSTM monster.

I took a quick look. I was able to repeat the hang/dead kernel in Jupyter notebook and it failing on the command line.

From the command line I noticed it was not finding ‘libcudnn_ops_train.so.8’.
More information below the workaround.

And I was able to fix this by…

  1. Stopped jupyter notebook
  2. Set the LD_LIBRARY_PATH (which I thought was set in ld.so.conf.d
    $ export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow:$LD_LIBRARY_PATH
    • I will look at why, this is a quick work around
  3. Start Jupyter notebook
    $ jupyter notebook
  4. Run the code as is in Jupyter Notebook
    $ web gui tool - which you obviously know :slight_smile:

From the command line I just commented out the two nvidia-smi, since I was watching both in nvtop and with:
$ nvidia-smi --query-gpu=index,pci.bus_id,fan.speed,utilization.gpu,utilization.memory,temperature.gpu,power.draw --format=csv -l

Here is the longer form of the message from the command line:
2022-06-09 23:35:37.010639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 8063 MB memory: → device: 1, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:21:00.0, compute capability: 8.6
2022-06-09 23:35:38.611401: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8303
Could not load library libcudnn_adv_train.so.8. Error: libcudnn_ops_train.so.8: cannot open shared object file: No such file or directory
[lambda-dual:705714] *** Process received signal ***
[lambda-dual:705714] Signal: Aborted (6)
[lambda-dual:705714] Signal code: (-6)
… then it hung … until eventually the kernel in jupyter notebook died.

NOTE you can turn off much of the noisy tensorflow default messages with:
$ TF_CPP_MIN_LOG_LEVEL=3 python LSTM-hell.py

The workaround:
$ export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow:$LD_LIBRARY_PATH
$ python LSTM-hell.py
… tensorflow noisy messages …
2022-06-09 23:37:27.934136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 4138 MB memory: → device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6
2022-06-09 23:37:27.934442: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-09 23:37:27.934648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 8063 MB memory: → device: 1, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:21:00.0, compute capability: 8.6
2022-06-09 23:37:29.485452: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8303
1/1 [==============================] - 2s 2s/step - loss: 5.0529

1 Like
wsadmin@AIML1001:/usr/lib/python3/dist-packages$ find tensorflow*/ libcudnn_ops_train.so.8
find: ‘libcudnn_ops_train.so.8’: No such file or directory

Hello, thanks for the detailed response. What is your LD_LIBRARY_PATH? Mine was not set before running export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow:$LD_LIBRARY_PATH so it became /usr/lib/python3/dist-packages/tensorflow:
And here is me running the workaround:

wsadmin@AIML1001:/home/mher/projects/Untitled Folder$ printenv | grep LD_
LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow:
wsadmin@AIML1001:/home/mher/projects/Untitled Folder$ sudo TF_CPP_MIN_LOG_LEVEL=3 python3 LSTM-hell.py
No protocol specified
Traceback (most recent call last):
  File "LSTM-hell.py", line 52, in <module>
    regressor.fit(x, y_labels, batch_size=1)
  File "/usr/lib/python3/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/lib/python3/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 1, 1, 1, 2, 1, 1]
         [[{{node CudnnRNN}}]]
         [[sequential/lstm/PartitionedCall]] [Op:__inference_train_function_2337]

Here is the LSTM_hell.py file content:

wsadmin@AIML1001:/home/mher/projects/Untitled Folder$ cat LSTM-hell.py
#!/usr/bin/env python
# coding: utf-8

# In[1]:


import tensorflow as tf
import numpy as np
import pandas as pd
import tensorflow.keras as k
from tensorflow.keras.layers import LSTM, Dense, Reshape, RepeatVector
from tensorflow.keras.models import Sequential


# In[2]:


y_posterior = lambda x: 2.71 ** x + 5 * x + 1.2

x = np.random.normal(0, 1, (1, 2))
y_labels = y_posterior(x)[:, 0]


# In[3]:


y_labels


# In[4]:


#get_ipython().system('nvidia-smi')


# In[5]:


regressor = Sequential()

# regressor.add(Dense(1, input_shape=(1,)))
regressor.add(LSTM(1,
    batch_input_shape=(1, 2, 1)))
# regressor.add(Dense(1))

regressor.compile(optimizer = 'sgd', loss = 'mean_squared_error')


# In[6]:


regressor.fit(x, y_labels, batch_size=1)


# In[7]:


x.shape


# In[8]:


y_labels.shape


# In[9]:


x


# In[10]:


#get_ipython().system('nvidia-smi')


# In[ ]:


import sys
sys.version

That is a good point.
I should have mentioned it is located in /usr/lib/python3/dist-packages/tensorflow if you have installed Lambda Stack for libcudnn_ops_train.so.8

If you do not have lambda stack you would need to install cudnn from NVIDIA, where you need to register to get to the cudnn download page.

Cudnn: CUDA Deep Neural Network (cuDNN) | NVIDIA Developer

And you need to make sure wherever you install it the LD_LIBRARY_PATH can be seen. Also so that the version is compatible with the driver and CUDA you have installed.

If you are running inside of anaconda. Anaconda does not setup the library paths for packages.

You should be able to find the path, the likely locations would be:

  • /usr
  • /usr/local - that is only suppose to be used for local sites software
  • /opt/<vendor>/<product>/<version> - standard for Vendor software
  • ~/Anaconda3 or I use: find ${CONDA_PREFIX} -name ‘libcudnn.so*’
    LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:${LD_LIBRARY_PATH} jupyter notebook

Let me know if that helps.

Or if you run directly from python (with the nvidia-smi lines commented out)
python LSTM-hell.py | tee -a LSTM-hell.txt

And send the LSTM-hell.txt

Apologies if I’m messing up anywhere along the steps to the workaround but now I notice that I actually have the files under that directory:

wsadmin@AIML1001:/usr/lib/python3/dist-packages/tensorflow$ ls
_api      distribute   libcudnn_adv_infer.so.8  libcudnn_cnn_train.so.8  libcudnn.so.8                 __pycache__
compiler  include      libcudnn_adv_train.so.8  libcudnn_ops_infer.so.8  libtensorflow_framework.so.2  python
core      __init__.py  libcudnn_cnn_infer.so.8  libcudnn_ops_train.so.8  lite                          tools

And I updated the Lambda Stack a week ago (this issue persisted despite the upgrade) using the command mentioned in the official website:

sudo apt-get update && sudo apt-get dist-upgrade

I don’t use anaconda

Now when I run

wsadmin@AIML1001:/home/mher/projects/Untitled Folder$ export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow
wsadmin@AIML1001:/home/mher/projects/Untitled Folder$ sudo python3 LSTM-hell.py | sudo tee -a LSTM-hell.txt
No protocol specified
2022-06-10 12:27:49.603201: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.603442: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.603651: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.630590: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.630879: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.631090: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.631295: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.631495: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.631694: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.816451: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.816679: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.816871: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.817064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.817246: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.817426: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.817607: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.817787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:49.817967: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.611852: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.612113: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.612329: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.612530: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.612729: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.612922: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.613117: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.613318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22278 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:29:00.0, compute capability: 8.6
2022-06-10 12:27:50.613597: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.613775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22278 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:41:00.0, compute capability: 8.6
2022-06-10 12:27:50.613972: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-06-10 12:27:50.614148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22265 MB memory:  -> device: 2, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:61:00.0, compute capability: 8.6
2022-06-10 12:27:52.155343: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8303
2022-06-10 12:27:52.156458: E tensorflow/stream_executor/dnn.cc:868] <unknown cudnn status: 14>
in tensorflow/stream_executor/cuda/cuda_dnn.cc(2019): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2022-06-10 12:27:52.156485: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at cudnn_rnn_ops.cc:1563 : INTERNAL: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 1, 1, 1, 2, 1, 1]
Traceback (most recent call last):
  File "LSTM-hell.py", line 52, in <module>
    regressor.fit(x, y_labels, batch_size=1)
  File "/usr/lib/python3/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/lib/python3/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 1, 1, 1, 2, 1, 1]
         [[{{node CudnnRNN}}]]
         [[sequential/lstm/PartitionedCall]] [Op:__inference_train_function_2337]
wsadmin@AIML1001:/home/mher/projects/Untitled Folder$ cat LSTM-hell.txt

So the LSTM-hell.txt appears to be empty
The output is exactly the same after running
export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow: instead of
export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow


How do I check that?

How do I check that? I assume it would be the correct version since the stack update should’ve taken care of that, and I didn’t encounter any noticeable errors when updating it.

I assume that I don’t need to do this since I’m not using anaconda


I decided to first fix the issue using just normal python so I never did these steps, I assume that they are non-critical to the workaround

I think it will be easier working with you directly. Please send a email to support@lambdal.com - and I can work directly with you.

It is good you now have the libcudnn.so and the other parts of Lambda Stack.
And that is working correctly on my machine as long as I set the LD_LIBRARY_PATH

I was giving you options, since it was not clear what you had installed or how you were running. (1. Lambda Stack 2. Without Lambda Stack what you need to do. 3. In Anaconda). Lambda Stack is the simplest way.

There should not be any sudo required. The ‘tee -a’ was just to capture stdout so we can skip that.

$ export LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow
$ LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow python LSTM-hell.py
… lots of tensorflow informational messages …
1/1 [==============================] - 1s 1s/step - loss: 5.9683

or
$ TF_CPP_MIN_LOG_LEVEL=1 LD_LIBRARY_PATH=/usr/lib/python3/dist-packages/tensorflow python LSTM-hell.py
1/1 [==============================] - 1s 1s/step - loss: 5.9683

I just added the ‘TF_CPP_MIN_LOG_LEVEL=1’ to reduce the noise.

NOTE: If I do not set the LD_LIBRARY_PATH:
$ TF_CPP_MIN_LOG_LEVEL=1 python LSTM-hell.py
Invalid MIT-MAGIC-COOKIE-1 keyCould not load library libcudnn_adv_train.so.8. Error: libcudnn_ops_train.so.8: cannot open shared object file: No such file or directory
[lambda-dual:867480] *** Process received signal ***
[lambda-dual:867480] Signal: Aborted (6)
[lambda-dual:867480] Signal code: (-6)
[lambda-dual:867480] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f7963139090]
[lambda-dual:867480] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f796313900b]
[lambda-dual:867480] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7963118859]
[lambda-dual:867480] [ 3] /usr/lib/python3/dist-packages/tensorflow/python/…/libcudnn.so.8(cudnnRNNForwardTraining+0x216)[0x7f78dc112af6]
[lambda-dual:867480] [ 4] /usr/lib/python3/dist-packages/tensorflow/python/…/libtensorflow_framework.so.2(cudnnRNNForwardTraining+0x8c)[0x7f7914fb758c]

1 Like

This worked! Thank you so much