Lambda stack has a pytorch/CUDA version incompatibility?

starkj · May 1, 2023, 2:06am

I am experiencing a seemingly similar problem. I’ve had my Lambda Tensorbook for about a year now, and a few times since then it suddenly has a problem where it doesn’t recognize the GPU. This normally happens after the computer goes to sleep (power cord removed, and left to idle), then rebooted. But it has also happened after just turned off for a couple days then booted normally. nvidia-smi usually doesn’t run at all (missing components), and sudo nvidia-settings has some assertion failures. To resolve these problems, after playing around a bit, I resort to updating the nvidia driver (usually by updating the Lambda stack with sudo apt-get update && sudo apt-get dist-upgrade. A reboot after this sets everything to working again. I’m certainly in favor of keeping drivers & other packages updated, but I don’t understand why a working system would suddenly stop working when there have been no changes. It seems that something related to the GPU/CUDA configuration is very fragile, especially when the power is disrupted. How can I make my installation more robust?
Thank you.

Topic		Replies	Views
Pytorch sometimes fails to recognize GPU Technical Help	8	2575	September 28, 2020
PyTorch 1.7.1 is incompatible with CUDA 11.1 which installed by lambda stack Technical Help	1	2662	February 5, 2021
Lambda Stack version archive Technical Help	2	1440	June 14, 2023
Lambda Stack CUDA and Nvidia drivers Technical Help	8	3112	March 13, 2024
CUDA pytorch not working on a fresh install RTX 3070 Technical Help	1	1716	March 10, 2021

Lambda stack has a pytorch/CUDA version incompatibility?

Related topics