Nvidia-smi freezes/hangs indefinitely

We purchased a Lambdal Quad earlier this summer. It has been working just fine for months, but all of a sudden, yesterday it broke: when we type nvidia-smi, it hangs/freezes indefinitely! We also can’t import tensorflow or torch (both of them also hangs indefinitely).

------ Update

After waiting for a long while, we get this error message when type nvidia-smi:

xxx@arthur2:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:05:00.0: Unknown Error

We additionally ran two diagnostic commands, one is nvidia-bug-report.sh, generated a long log file (there are too many information in there for us to understand).

We also ran dmesg | tail -n 10

[   15.487615] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input27
[   15.487780] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input14
[   15.487872] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input15
[   15.487962] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input28
[   15.488220] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input16
[   18.121912] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[   18.121939] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
[   22.457798] kauditd_printk_skb: 32 callbacks suppressed
[   22.457800] audit: type=1400 audit(1534714452.748:43): apparmor="DENIED" operation="open" profile="/usr/sbin/cups-browsed" name="/usr/share/cups/locale/" pid=1092 comm="cups-browsed" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   22.457807] audit: type=1400 audit(1534714452.748:44): apparmor="DENIED" operation="open" profile="/usr/sbin/cups-browsed" name="/usr/share/locale/" pid=1092 comm="cups-browsed" requested_mask="r" denied_mask="r" fsuid=0 ouid=0

------------------ Update

After another restart on a different day, nvidia-smi seems to be back to work. Now we can run tensorflow, pytorch normally just as before! I guess this is not urgent anymore but we are curious to know what went wrong and what we can do to prevent this!

I’ve been asking other people in the lab to start running their programs again to see if the machine will break again. I guess it would nice if I can provide some diagnostic details here for you guys and it would be nice if we can get some explanation. I’ll keep updating this issue if it breaks again.