So this is on a lambda desktop hardware. I just reinstalled the os and put on lambda stack to be able to get torch running again. No joy. Before I go and start messing with drivers again I was hoping someone might be able to help tell me why a fresh install is having issues.
The CUDA version seems to be ahead of that used by the latest pytorch.
This is a fresh OS and lambda stack. I thought this should work out of the box.
Solution:
Ok I figured it out. I was looking over the additional drivers section and I saw it had thought it installed the drivers for a P5000 instead of RTX6000. WTF! I also had been wondering why my standard setup worked on another system with near identical specs.
Turns out this box had been used by the deployment folks to test different graphics cards for deployment. So I took a look at the hardware and sure enough, tucked beneath the two RTX 6000s was a P5000 in slot 3.
I pulled that and it worked. I think it comes down to the RTX and Pxx series having different internal architectures and so can’t share a driver.
Final verdict. Dont mix GPUs. Thanks for the help.