We have several Tensorbooks running Ubunut 2022 or 2023 all of which will randomly become very slow for no reason at that we can see.
By slow I mean:
GUI applications become visibly slow to update.
Typing in a shell becomes laggy and almost unusable.
The time it takes to compile our C++ project goes from taking less than 1 minute more than doubles or triples.
I’ve made a number of attempts to diagnose the issue but I’ve come up short on any solution.
I have noticed that it appears that maybe (based on info from the BTop utility) the CPU’s are being throttled way down (i.e. 400Mhz) even with the governor set at high performance.
We would greatly appreciate any hints or help on resolving the issue.
No, I have talked with them about replacement hardware.
I will say that they seem to be very reluctant to support our unit that has Ubuntu 23.04 installed, they only support their configuration for Ubuntu 22.04, which is understandable.
I’m seeing the same issue. Slowdown happens reliably, usually within a couple of hours. Sometimes it can recover, and other times I just need to reboot between training runs. Training runs over 1 hour will hit this problem 50% of the time. It almost seems to happen more easily when matplotlib is being used to render something, but I can’t say this is the main culprit. It can happen completely out of the blue. Happens a little bit less when the laptop is being actively used. Usually if it sits idle for a bit is when it starts being unusable. Both CPU and GPU show very low utilization when this happens. Temperatures are all in check. Fans not running.
I wonder if someone from Lambda can tell us which kernel version we should be running to avoid this problem? I don’t know if it’s an NVIDIA / CUDA problem or a kernel issue, but it would be nice to figure out how to get around it. As it stands right now the laptop has to be restarted every ~30 minutes, which makes it unusable for running any small scale experiments that take longer than that or doing any large-ish dataset pre-processing
@ryans thanks for the tip! I’ll give it a shot once it gets into a weird state again. I have suspected it has something to do with thermal throttling. Even though nvitop shows GPU temp @ 55-60C and CPU temps were showing even lower than GPU, but the top left area above the keyboard right under the screen did feel pretty hot to touch. I even went as far as try to install some kind of third party fan control to try and keep the fans running more aggressively, but with no luck.
My laptop sits on a cooling pad in a fairly cold-ish space, so I’ll confirm the issue with btop first, then try disabling thermald. Thanks again for the suggesion
Just following up here, I can confirm that stopping thermald via sudo systemctl stop thermald prevents this issue from happening. The laptop has been up for over 24 hours, and has done several ~1.5 hr long training runs without encountering the slowdown.
So based on the thread @ryans has linked to this seems to be linked to some recent kernel updates and thermald specifically. There are also alternative solutions listed in there as well such as the razer control tool here: GitHub - Razer-Linux/razer-laptop-control-no-dkms which can change the CPU performance setting to a value that seems to eliminate the problem even if thermald is running. Haven’t tried that one, running without thermald for now seems to be fine and has no negative effect on fans or cooling.
Same, day 2 without restart and not encountering any slowdown. Thanks again for pointing out the solution! Incredible how much this speeds up the workflow after dealing with this problem for probably months.
Oh man. Thanks so much for figuring this out guys! I’ve been dealing with this issue for quite a while and it’s been incredibly annoying. I went ahead and disabled the theramld service as well with sudo systemctl disable thermald so it doesnt come back up on reboot.