Frequent segmentation faults

Hi, I am using one of your Tensorbooks with an RTX 3080 laptop GPU and 16 cpus, running Ubuntu 20.04 for the past 1+ year. I am using it heavily, often leaving RL training jobs running overnight. It can get a bit hot to the touch, but nvidia-smi reports gpu temps (50-70 C) well below its warning threshold. My concern is that lately my software is frequently getting lots of segfaults (i.e. SIGSEGV) that I never used to see. It is not only in my RL models, which I am constantly modifying, but also in Tensorboard, installers and miniconda. This makes me wonder if it might be a hardware fault due to all the hard use.

Is there a toolset that I can use to do some hardware diagnostics, such as checking correct memory function? I don’t see anything on the boot menus.

Thank you.

SEGV or segfaults are caused when a code is attempting to access invalid or protected memory.
The reasons for this can be:
Software:

  • Application has a bug and trying to allocate/use a incorrect memory address
    • This can be due to data the code uses, corrupt library, bug in the code it self
      Hardware:
  • Commonly a memory error or CPU/cache error

The easiest way to find a hardware error is look in dmesg or the kernel logs:
$ sudo dmesg | egrep “Hardware Error”
$ sudo zegrep “Hardware Error” /var/log/kern.log*

And to test for memory errors, you can boot from Memtest86
To test the memory, you’ll want to burn the Memtest86 image to a USB drive, boot into i
t, and run the test.

Thanks @markd . The first two commands show nothing, but I’m wondering how long such error information sticks around. I will set up a memtest86 drive & try that at my next boot opportunity.

1 Like