Lambda stack has a pytorch/CUDA version incompatibility?

Lambda stack appears to be installing incompatible versions of CUDA and torchvision. Either that or torchvision doesn’t have cuda enabled. I have been following the instructions here to train a yolov7 model:

(It doesn’t matter if I skip the whole venv setup or not, the result is the same).

I have tested this both in a Lambda docker container and out. It makes no difference either way. I get to running the horses detection example and get some output and then an error:

Fusing layers... 
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
Model Summary: 306 layers, 36905341 parameters, 6652669 gradients
 Convert model to Traced-model... 
 traced_script_module saved! 
 model is traced!

<snipped long traceback>

NotImplementedError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. This could be because
the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible 
resolutions. 'torchvision::nms' is only available for these backends: [CPU, QuantizedCPU, BackendSelect, Python, 
FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, 
AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, 
AutogradLazy, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, 
FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

If, on the other hand, I spin up a fresh Ubuntu docker container with no Lambda, install Python and pip with apt, and pip install everything listed in the yolov7 requirements.txt. Then I can run detect.py with no errors. (Still haven’t investigated whether it runs on the CPU or GPU in this case) This leads me to conclude that there is an incompatibility somewhere in the Lambda stack itself.

Edit: Here is what torch.utils.collect_env says:

$ python3 -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64)
GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.19.0-38-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4080
Nvidia driver version: 525.89.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.5
[pip3] torch==1.13.1
[pip3] torchvision==0.14.1
[conda] Could not collect

I have the same problem. nvidia-smi shows CUDA 11.7, nvvc --version CUDA 11.6.

nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10          On   | 00000000:06:00.0 Off |                    0 |
|  0%   32C    P8    20W / 150W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

Did anyone figure this out? The whole point of Lambda for me is avoiding this.

I am experiencing a seemingly similar problem. I’ve had my Lambda Tensorbook for about a year now, and a few times since then it suddenly has a problem where it doesn’t recognize the GPU. This normally happens after the computer goes to sleep (power cord removed, and left to idle), then rebooted. But it has also happened after just turned off for a couple days then booted normally. nvidia-smi usually doesn’t run at all (missing components), and sudo nvidia-settings has some assertion failures. To resolve these problems, after playing around a bit, I resort to updating the nvidia driver (usually by updating the Lambda stack with sudo apt-get update && sudo apt-get dist-upgrade. A reboot after this sets everything to working again. I’m certainly in favor of keeping drivers & other packages updated, but I don’t understand why a working system would suddenly stop working when there have been no changes. It seems that something related to the GPU/CUDA configuration is very fragile, especially when the power is disrupted. How can I make my installation more robust?
Thank you.