CUDA drivers not working on fresh 1xH100 instance

I get this error on a fresh gpu_1x_h100_pcie instance.

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

The only actions I took on the machine were creating symlinks to cuDNN libraries (Why can't my program find the NVIDIA cuDNN library? | Lambda Docs) and sudo apt upgrade.

I also tried several other commands which all failed to find a GPU on the host.

I reproduced this on 2x instances.

The same Issue. H100 in Utah

Hello @cudahell,

I was not able to reproduce this.
If you upgrade the kernel you need to reboot the node because the kernel modules and the libraries have different versions but this should be a different error message:

$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

You can send us your account information (email address) to see on which baremetal node were the VMs and match this with known (or not) hardware errors.

If you prefer we can do this over a support ticket here: https://support.lambdalabs.com

Best,
Yanos

@yanos I made another instance 5 hours later, and it worked fine. Just submitted a support ticket. Let me know if there’s any other useful info I can provide.

@cudahell, @md23
Thank you very much for letting us know about this.
This was a hardware issue on the GPUs.
You definitely helped a lot of other users from having the same frustration.

We are also working on adding a check to auto-blocklist GPUs with similar hardware failures.

Hi

I got the same issue i have submitted a ticket

This was the same issue.
Since we have a low availability on the H100s, this kind of hardware failures show up more often because only the failed instances are the ones that eventually become available.

We will hopefully have a fix for this soon so that they don’t become available after the first failure.

I am seeing the same issue as well on a new node I got a few days ago.

same issue here. first time user of lambda :frowning: