CUDA drivers not working on fresh 1xH100 instance

cudahell · July 19, 2023, 12:20am

I get this error on a fresh gpu_1x_h100_pcie instance.

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

The only actions I took on the machine were creating symlinks to cuDNN libraries (Why can't my program find the NVIDIA cuDNN library? | Lambda Docs) and sudo apt upgrade.

I also tried several other commands which all failed to find a GPU on the host.

I reproduced this on 2x instances.

md23 · July 19, 2023, 5:45pm

The same Issue. H100 in Utah

yanos · July 19, 2023, 5:51pm

Hello @cudahell,

I was not able to reproduce this.
If you upgrade the kernel you need to reboot the node because the kernel modules and the libraries have different versions but this should be a different error message:

$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

You can send us your account information (email address) to see on which baremetal node were the VMs and match this with known (or not) hardware errors.

If you prefer we can do this over a support ticket here: https://support.lambdalabs.com

Best,
Yanos

cudahell · July 19, 2023, 5:59pm

@yanos I made another instance 5 hours later, and it worked fine. Just submitted a support ticket. Let me know if there’s any other useful info I can provide.

yanos · July 19, 2023, 9:47pm

@cudahell, @md23
Thank you very much for letting us know about this.
This was a hardware issue on the GPUs.
You definitely helped a lot of other users from having the same frustration.

We are also working on adding a check to auto-blocklist GPUs with similar hardware failures.

jsvital121 · July 29, 2023, 3:32am

Hi

I got the same issue i have submitted a ticket

yanos · August 1, 2023, 12:15am

This was the same issue.
Since we have a low availability on the H100s, this kind of hardware failures show up more often because only the failed instances are the ones that eventually become available.

We will hopefully have a fix for this soon so that they don’t become available after the first failure.

arvind · August 6, 2023, 4:30pm

I am seeing the same issue as well on a new node I got a few days ago.

rperrett · August 28, 2023, 9:21am

same issue here. first time user of lambda

Topic		Replies	Views
H100 instance detection failed Technical Help	0	357	March 22, 2024
No GPU (nvidia-smi failed) Technical Help	3	577	July 11, 2024
Failed to initialize NVML: Driver/library version mismatch Technical Help	2	11466	January 12, 2018
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Technical Help	6	71257	April 9, 2020
CUDA NVIDIA NVML Driver/Library version mismatch - update-initramfs -u not working - old kernel version Technical Help	1	4691	June 27, 2018

CUDA drivers not working on fresh 1xH100 instance

Related topics