We purchased a Lambdal Quad earlier this summer. It has been working just fine for months, but all of a sudden, yesterday it broke: when we type
nvidia-smi, it hangs/freezes indefinitely! We also can’t import
torch (both of them also hangs indefinitely).
After waiting for a long while, we get this error message when type
xxx@arthur2:~$ nvidia-smi Unable to determine the device handle for GPU 0000:05:00.0: Unknown Error
We additionally ran two diagnostic commands, one is
nvidia-bug-report.sh, generated a long log file (there are too many information in there for us to understand).
We also ran
dmesg | tail -n 10
[ 15.487615] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input27 [ 15.487780] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input14 [ 15.487872] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input15 [ 15.487962] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input28 [ 15.488220] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input16 [ 18.121912] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None [ 18.121939] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready [ 22.457798] kauditd_printk_skb: 32 callbacks suppressed [ 22.457800] audit: type=1400 audit(1534714452.748:43): apparmor="DENIED" operation="open" profile="/usr/sbin/cups-browsed" name="/usr/share/cups/locale/" pid=1092 comm="cups-browsed" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 [ 22.457807] audit: type=1400 audit(1534714452.748:44): apparmor="DENIED" operation="open" profile="/usr/sbin/cups-browsed" name="/usr/share/locale/" pid=1092 comm="cups-browsed" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
After another restart on a different day,
nvidia-smi seems to be back to work. Now we can run tensorflow, pytorch normally just as before! I guess this is not urgent anymore but we are curious to know what went wrong and what we can do to prevent this!
I’ve been asking other people in the lab to start running their programs again to see if the machine will break again. I guess it would nice if I can provide some diagnostic details here for you guys and it would be nice if we can get some explanation. I’ll keep updating this issue if it breaks again.