I get this error on a fresh gpu_1x_h100_pcie instance.
$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I was not able to reproduce this.
If you upgrade the kernel you need to reboot the node because the kernel modules and the libraries have different versions but this should be a different error message:
$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
You can send us your account information (email address) to see on which baremetal node were the VMs and match this with known (or not) hardware errors.
@yanos I made another instance 5 hours later, and it worked fine. Just submitted a support ticket. Let me know if there’s any other useful info I can provide.
@cudahell, @md23
Thank you very much for letting us know about this.
This was a hardware issue on the GPUs.
You definitely helped a lot of other users from having the same frustration.
We are also working on adding a check to auto-blocklist GPUs with similar hardware failures.
This was the same issue.
Since we have a low availability on the H100s, this kind of hardware failures show up more often because only the failed instances are the ones that eventually become available.
We will hopefully have a fix for this soon so that they don’t become available after the first failure.