Lambda stack appears to be installing incompatible versions of CUDA and torchvision. Either that or torchvision doesn’t have cuda enabled. I have been following the instructions here to train a yolov7 model:
(It doesn’t matter if I skip the whole venv setup or not, the result is the same).
I have tested this both in a Lambda docker container and out. It makes no difference either way. I get to running the horses detection example and get some output and then an error:
Fusing layers...
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
RepConv.fuse_repvgg_block
Model Summary: 306 layers, 36905341 parameters, 6652669 gradients
Convert model to Traced-model...
traced_script_module saved!
model is traced!
<snipped long traceback>
NotImplementedError: Could not run 'torchvision::nms' with arguments from the 'CUDA' backend. This could be because
the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible
resolutions. 'torchvision::nms' is only available for these backends: [CPU, QuantizedCPU, BackendSelect, Python,
FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView,
AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU,
AutogradLazy, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode,
FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].
If, on the other hand, I spin up a fresh Ubuntu docker container with no Lambda, install Python and pip with apt, and pip install everything listed in the yolov7 requirements.txt. Then I can run detect.py with no errors. (Still haven’t investigated whether it runs on the CPU or GPU in this case) This leads me to conclude that there is an incompatibility somewhere in the Lambda stack itself.