Hi, I am using one of your Tensorbooks with an RTX 3080 laptop GPU and 16 cpus, running Ubuntu 20.04 for the past 1+ year. I am using it heavily, often leaving RL training jobs running overnight. It can get a bit hot to the touch, but nvidia-smi reports gpu temps (50-70 C) well below its warning threshold. My concern is that lately my software is frequently getting lots of segfaults (i.e. SIGSEGV) that I never used to see. It is not only in my RL models, which I am constantly modifying, but also in Tensorboard, installers and miniconda. This makes me wonder if it might be a hardware fault due to all the hard use.
Is there a toolset that I can use to do some hardware diagnostics, such as checking correct memory function? I don’t see anything on the boot menus.
SEGV or segfaults are caused when a code is attempting to access invalid or protected memory.
The reasons for this can be:
Software:
Application has a bug and trying to allocate/use a incorrect memory address
This can be due to data the code uses, corrupt library, bug in the code it self
Hardware:
Commonly a memory error or CPU/cache error
The easiest way to find a hardware error is look in dmesg or the kernel logs:
$ sudo dmesg | egrep “Hardware Error”
$ sudo zegrep “Hardware Error” /var/log/kern.log*
And to test for memory errors, you can boot from Memtest86
To test the memory, you’ll want to burn the Memtest86 image to a USB drive, boot into i
t, and run the test.
Thanks @markd . The first two commands show nothing, but I’m wondering how long such error information sticks around. I will set up a memtest86 drive & try that at my next boot opportunity.