Summary
- Running H100(80GB PCle) instance
- the command
nvidia-smi
doesn’t work - installing nvidia driver failed, detection doesn’t work.
- H100 hardware is not detectaed when run
lspci -k | grep -EA3 'VGA|3D|Display'
norubuntu-drivers devices
Reproduction of the error
- After running the instance,
nvidia-smi
it returns this error :ubuntu@209-20-157-137:~$ nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
- Tried installing NVIDIA driver by running :
apt update apt -y upgrade apt -y install ubuntu-drivers-common ubuntu-drivers devices
- When I check installed driver :
root@209-20-157-137:~# apt --installed list | grep nvidia-driver WARNING: apt does not have a stable CLI interface. Use with caution in scripts. nvidia-driver-local-repo-ubuntu2204-535.161.08/now 1.0-1 amd64 [installed,local]
- I’m trying to test
threestudio
using H100. So I’ve tried building my Dockerfile and run the command, but it returns this error :root@209-20-157-137:/home/ubuntu/A100/threestudio/docker# sudo docker compose up -d [+] Running 0/1 ⠇ Container docker-threestudio-1 Starting Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown root@209-20-157-137:/home/ubuntu/A100/threestudio/docker#
- H100 detection seems failed :
root@209-20-157-137:~# lspci -k | grep -EA3 'VGA|3D|Display' 01:00.0 VGA compatible controller: Red Hat, Inc. Virtio GPU (rev 01) Subsystem: Red Hat, Inc. Virtio GPU Kernel driver in use: virtio-pci 02:00.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01)
Am I using H100 right?
But it seems we cannot detect the hardware…
We’ve paid for literally a whole day, couldn’t even use the H100 GPU.
Really disappointed:(
Here is the instance ID : 9f05500760184aba8e7535cc453b684e