Pytorch sometimes fails to recognize GPU

I’m using PyTorch on a Tensorbook, and sometimes it fails to see the GPU. That is, it becomes true that torch.cuda.is_available() == False. So far, I’ve found that rebooting reliably fixes the problem, but that’s of course not a very satisfactory long-term approach. Some googling indicates that the usual way of fixing the problem is to change the versions of CUDA libraries or other components (by reinstalling them). I’d rather not do this as I’m using the lambda stack and I don’t want to then have a non-standard installation. Is anyone else seeing this? I can post versions of various components if that’s helpful.

Have you done any other installations alongside your Lambda Stack install? You can always email support@lambdalabs.com to report a Lambda Stack bug. Not sure if it’s a bug on our end. Is it a clean Ubuntu 20.04 LTS + Lambda Stack install?

Thanks. Yes, it’s a brand-new Tensorbook. I installed my project using conda so it has local (to the project) versions of python (3.7.7) and pytorch (1.6.0). My conda environment doesn’t do anything with CUDA (I rely on the pytorch official Docker images when I deploy to the cloud).

Do you have this problem when you’re outside of conda and use the Lambda Stack versions of pytorch?

I do see the same with /usr/bin/python, is that the correct python? I’m not sure where Lamba Stack installs stuff.

Can you try updating to the latest version of Lambda Stack with this command:

sudo apt-get update && sudo apt-get upgrade -y

Thanks. I’ve done that, it didn’t immediately fix the problem (CUDA was still unavailable) so I rebooted. I’ll let you know if the problem reoccurs post-reboot.

Did this fix the problem for you? Would be great to know!

Unfortunately, I did the updates and it didn’t fix the problem :(, I still have to reboot sometimes to “reactivate” CUDA wrt PyTorch.