Lambda Tensorbook - Unable to recognize GPU with PyTorch

I’m unable to recognize the GPU when installing PyTorch. Initially, I could recognize the GPU by rebooting, but that no longer works.

I have followed the instructions at this post to create a Conda environment and install PyTorch with GPU support:

$ conda create -n pytorch-gpu python=3.8
$ conda activate pytorch-gpu
$ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

$ python
Python 3.8.12 (default, Oct 12 2021, 13:49:34)
[GCC 7.5.0] :: Anaconda, Inc. on linux

import torch
torch.cuda.is_available()
False

I have also tried rebooting and re-creating a new environment as suggested here and that does not work.:
I Pytorch sometimes fails to recognize GPU

Any direction would be greatly appreciated.

Thanks,
Jay

Note: I have also upgraded the system:
$ sudo apt-get update && sudo apt-get upgrade -y

And still no GPU.

Hi,
are you able to run nvidia-smi ?

No. I get the following. I have not tried changing the installation.

Command ‘nvidia-smi’ not found, but can be installed with:

sudo apt install nvidia-utils-435 # version 435.21-0ubuntu7, or
sudo apt install nvidia-utils-440 # version 440.82+really.440.64-0ubuntu6
sudo apt install nvidia-340 # version 340.108-0ubuntu5.20.04.2
sudo apt install nvidia-utils-390 # version 390.144-0ubuntu0.20.04.1
sudo apt install nvidia-utils-450-server # version 450.172.01-0ubuntu0.20.04.1
sudo apt install nvidia-utils-470 # version 470.103.01-0ubuntu0.20.04.1
sudo apt install nvidia-utils-470-server # version 470.103.01-0ubuntu0.20.04.1
sudo apt install nvidia-utils-510 # version 510.47.03-0ubuntu0.20.04.1
sudo apt install nvidia-utils-418-server # version 418.226.00-0ubuntu0.20.04.2

Thanks

ok this is something that happened to me after doing ubuntu server/desktop upgrade. For some reasons the nvidia drivers of lambda are overrided. A solution not optimal but that works for me is to reinstall lambda stack server and reboot.

LAMBDA_REPO=$(mktemp) && \
wget -O${LAMBDA_REPO} https://lambdalabs.com/static/misc/lambda-stack-repo.deb && \
sudo dpkg -i ${LAMBDA_REPO} && rm -f ${LAMBDA_REPO} && \
sudo apt-get update && \
sudo apt-get --yes upgrade && \
sudo apt-get install --yes --no-install-recommends lambda-server && \
sudo apt-get install --yes --no-install-recommends nvidia-headless-470-server && \
sudo apt-get install --yes --no-install-recommends nvidia-fabricmanager-470 && \
sudo apt-get install --yes --no-install-recommends lambda-stack-cuda

and then reboot

sudo reboot now

tell me if this works

Reinstalling the Lambda stack did not help. But thanks, I really appreciate your help.

Should I reinstall the nVidia drivers?

Seems like something is out of sync.

Thanks.

just to understand, is ubuntu desktop or server?

ubuntu desktop (Tensorbook)

are you able to do a clean reinstall? I’ve found that the best way to install lambda stack is to install ubuntu without letting him to search for drivers (remove the tick during the installation) and then install lambda stack.
This is surely something about the drivers as you are not able to run nvidia-smi. If you can’t reinstall try purge all nvidia drivers and reinstall lambda stack (not for servers but for desktop)

But it looks like you are missing basic packages like the nvidia-util so that needs to be
resolved first, and ensure you have the driver installed.

Also from the above, it looks like you do not have Lambda stack installed, but instead it points only to Ubuntu.

  1. Is this a tensorbook, desktop or a server?

  2. Check to see if the kernel driver is loaded:
    $ lsmod | grep nvidia

  3. Check to make sure the nvidia packages are installed (and from which repository)
    (normally around 30-40 packages, or you may have old stale packages)
    $ dpkg --list | grep nvidia

  4. You can clean up what is there and reinstall lambda stack:
    Lambda Stack: an AI software stack that's always up-to-date
    * Depending on if this is a desktop or server has different instructions for packages
    to install. (You do not need headless or fabric manager on a desktop).
    a. To remove Lambda stack:
    $ sudo rm -f /etc/apt/sources.list.d/{graphics,nvidia,cuda}*;
    COLUMNS=200 dpkg -l |awk ‘/cuda|lib(accinj64|cu(blas|dart|dnn|fft|inj|pti|rand|solver|sparse)|magma|nccl|npp|nv[^p])|nv(idia|ml)|tensor(flow|board)|torch/ { print $2 }’ |
    sudo xargs -or apt -y remove --purge

    b. For a Desktop to re-install Lambda stack:
    LAMBDA_REPO=$(mktemp) &&
    wget -O${LAMBDA_REPO} https://lambdalabs.com/static/misc/lambda-stack-repo.deb &&
    sudo dpkg -i ${LAMBDA_REPO} && rm -f ${LAMBDA_REPO} &&
    sudo apt-get -y update && sudo apt-get -y install lambda-stack-cuda
    c. For a server to reinstall lambda stack:
    LAMBDA_REPO=$(mktemp) &&
    wget -O${LAMBDA_REPO} https://lambdalabs.com/static/misc/lambda-stack-repo.deb &&
    sudo dpkg -i ${LAMBDA_REPO} && rm -f ${LAMBDA_REPO} &&
    sudo apt-get update &&
    sudo apt-get --yes upgrade &&
    sudo apt-get install --yes --no-install-recommends lambda-server &&
    sudo apt-get install --yes --no-install-recommends nvidia-headless-470-server &&
    sudo apt-get install --yes --no-install-recommends nvidia-fabricmanager-470 &&
    sudo apt-get install --yes --no-install-recommends lambda-stack-cuda