Frequent segmentation faults

starkj · August 4, 2023, 2:48pm

Hi, I am using one of your Tensorbooks with an RTX 3080 laptop GPU and 16 cpus, running Ubuntu 20.04 for the past 1+ year. I am using it heavily, often leaving RL training jobs running overnight. It can get a bit hot to the touch, but nvidia-smi reports gpu temps (50-70 C) well below its warning threshold. My concern is that lately my software is frequently getting lots of segfaults (i.e. SIGSEGV) that I never used to see. It is not only in my RL models, which I am constantly modifying, but also in Tensorboard, installers and miniconda. This makes me wonder if it might be a hardware fault due to all the hard use.

Is there a toolset that I can use to do some hardware diagnostics, such as checking correct memory function? I don’t see anything on the boot menus.

Thank you.

markd · August 8, 2023, 10:00pm

SEGV or segfaults are caused when a code is attempting to access invalid or protected memory.
The reasons for this can be:
Software:

Application has a bug and trying to allocate/use a incorrect memory address
- This can be due to data the code uses, corrupt library, bug in the code it self
  Hardware:
Commonly a memory error or CPU/cache error

The easiest way to find a hardware error is look in dmesg or the kernel logs:
$ sudo dmesg | egrep “Hardware Error”
$ sudo zegrep “Hardware Error” /var/log/kern.log*

And to test for memory errors, you can boot from Memtest86
To test the memory, you’ll want to burn the Memtest86 image to a USB drive, boot into i
t, and run the test.

starkj · August 14, 2023, 2:23am

Thanks @markd . The first two commands show nothing, but I’m wondering how long such error information sticks around. I will set up a memtest86 drive & try that at my next boot opportunity.

Topic		Replies	Views
Problem with Quad machine	1	1567	September 18, 2018
Nvidia-smi freezes/hangs indefinitely Technical Help	0	3498	August 17, 2018
GPU Utilization log/summary	0	1055	September 21, 2020
Unable to determine the device handle for GPU, GPU is lost. Reboot the system to recover this GPU	3	10259	March 12, 2024
Unable to determine the device handle for GPU0000:21:00.0: Unknown Error Technical Help	1	3706	August 1, 2023

Frequent segmentation faults

Related topics