Nvidia-smi freezes/hangs indefinitely

anie · August 17, 2018, 8:14pm

We purchased a Lambdal Quad earlier this summer. It has been working just fine for months, but all of a sudden, yesterday it broke: when we type nvidia-smi, it hangs/freezes indefinitely! We also can’t import tensorflow or torch (both of them also hangs indefinitely).

------ Update

After waiting for a long while, we get this error message when type nvidia-smi:

xxx@arthur2:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:05:00.0: Unknown Error

We additionally ran two diagnostic commands, one is nvidia-bug-report.sh, generated a long log file (there are too many information in there for us to understand).

We also ran dmesg | tail -n 10

[   15.487615] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input27
[   15.487780] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input14
[   15.487872] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input15
[   15.487962] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input28
[   15.488220] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input16
[   18.121912] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[   18.121939] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
[   22.457798] kauditd_printk_skb: 32 callbacks suppressed
[   22.457800] audit: type=1400 audit(1534714452.748:43): apparmor="DENIED" operation="open" profile="/usr/sbin/cups-browsed" name="/usr/share/cups/locale/" pid=1092 comm="cups-browsed" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   22.457807] audit: type=1400 audit(1534714452.748:44): apparmor="DENIED" operation="open" profile="/usr/sbin/cups-browsed" name="/usr/share/locale/" pid=1092 comm="cups-browsed" requested_mask="r" denied_mask="r" fsuid=0 ouid=0

------------------ Update

After another restart on a different day, nvidia-smi seems to be back to work. Now we can run tensorflow, pytorch normally just as before! I guess this is not urgent anymore but we are curious to know what went wrong and what we can do to prevent this!

I’ve been asking other people in the lab to start running their programs again to see if the machine will break again. I guess it would nice if I can provide some diagnostic details here for you guys and it would be nice if we can get some explanation. I’ll keep updating this issue if it breaks again.

Topic		Replies	Views
GPU not used, rebooted, nvidia-smi has failed Technical Help	0	1126	April 7, 2021
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Technical Help	6	71308	April 9, 2020
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running Technical Help	0	2083	November 11, 2019
Can't install nvidia-smi Technical Help	8	2023	January 25, 2023
No GPU (nvidia-smi failed) Technical Help	3	607	July 11, 2024

Nvidia-smi freezes/hangs indefinitely

Related topics