We purchased a Lambdal Quad earlier this summer. It has been working just fine for months, but all of a sudden, yesterday it broke: when we type nvidia-smi
, it hangs/freezes indefinitely! We also can’t import tensorflow
or torch
(both of them also hangs indefinitely).
------ Update
After waiting for a long while, we get this error message when type nvidia-smi
:
xxx@arthur2:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:05:00.0: Unknown Error
We additionally ran two diagnostic commands, one is nvidia-bug-report.sh
, generated a long log file (there are too many information in there for us to understand).
We also ran dmesg | tail -n 10
[ 15.487615] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input27
[ 15.487780] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input14
[ 15.487872] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input15
[ 15.487962] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.1/sound/card4/input28
[ 15.488220] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.0/0000:07:00.0/0000:08:10.0/0000:09:00.1/sound/card2/input16
[ 18.121912] e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 18.121939] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
[ 22.457798] kauditd_printk_skb: 32 callbacks suppressed
[ 22.457800] audit: type=1400 audit(1534714452.748:43): apparmor="DENIED" operation="open" profile="/usr/sbin/cups-browsed" name="/usr/share/cups/locale/" pid=1092 comm="cups-browsed" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 22.457807] audit: type=1400 audit(1534714452.748:44): apparmor="DENIED" operation="open" profile="/usr/sbin/cups-browsed" name="/usr/share/locale/" pid=1092 comm="cups-browsed" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
------------------ Update
After another restart on a different day, nvidia-smi
seems to be back to work. Now we can run tensorflow, pytorch normally just as before! I guess this is not urgent anymore but we are curious to know what went wrong and what we can do to prevent this!
I’ve been asking other people in the lab to start running their programs again to see if the machine will break again. I guess it would nice if I can provide some diagnostic details here for you guys and it would be nice if we can get some explanation. I’ll keep updating this issue if it breaks again.