H100 instance detection failed

Summary

  • Running H100(80GB PCle) instance
  • the command nvidia-smi doesn’t work
  • installing nvidia driver failed, detection doesn’t work.
  • H100 hardware is not detectaed when run lspci -k | grep -EA3 'VGA|3D|Display' nor ubuntu-drivers devices

Reproduction of the error

  1. After running the instance, nvidia-smi it returns this error :
    ubuntu@209-20-157-137:~$ nvidia-smi
    NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
    
  2. Tried installing NVIDIA driver by running :
    apt update
    apt -y upgrade
    apt -y install ubuntu-drivers-common
    ubuntu-drivers devices
    
  3. When I check installed driver :
    root@209-20-157-137:~# apt --installed list | grep nvidia-driver
    WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
    nvidia-driver-local-repo-ubuntu2204-535.161.08/now 1.0-1 amd64 [installed,local]
    
  4. I’m trying to test threestudio using H100. So I’ve tried building my Dockerfile and run the command, but it returns this error :
    root@209-20-157-137:/home/ubuntu/A100/threestudio/docker# sudo docker compose up -d 
    [+] Running 0/1
     ⠇ Container docker-threestudio-1  Starting                                                                                                                                                                                                                           
    Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
    nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
    root@209-20-157-137:/home/ubuntu/A100/threestudio/docker# 
    
  5. H100 detection seems failed :
    root@209-20-157-137:~# lspci -k | grep -EA3 'VGA|3D|Display'
    01:00.0 VGA compatible controller: Red Hat, Inc. Virtio GPU (rev 01)
        Subsystem: Red Hat, Inc. Virtio GPU
        Kernel driver in use: virtio-pci
    02:00.0 SCSI storage controller: Red Hat, Inc. Virtio block device (rev 01)
    

Am I using H100 right?
But it seems we cannot detect the hardware…
We’ve paid for literally a whole day, couldn’t even use the H100 GPU.
Really disappointed:(

Here is the instance ID : 9f05500760184aba8e7535cc453b684e