I have a few 4-GPU vectors and have sporadic problems with the SSDs overheating and crashing the system. Has anyone else experienced this, and if so, do you have any fixes? A preferred aftermarket heat shield or an extra fan for example?
How did you trace the crashing to overheating SSDs?
Through a combination of looking at logs and monitoring via munin.
I have seen from reviews that the Sabrent heatsink has decent improvements for performance/cooling, but it is atypical that you’re experiencing crashes from the NVME drive, especially on multiple Vectors. It would be quite helpful both to us at Lambda and to other users if you could provide some of the logs/data that points you towards the drives on multiple machines experiencing overheating issues.
Also, are these machines plugged directly into the wall or are they perhaps plugged into a power strip/surge protector/UPS device? We commonly see crashes due to these peripheral devices.
The machines are plugged directly into the wall. It’s a little difficult to provide the datapoints as the logs are now gone. journalctl showed an ssd overheat warning just before the crashes occurred. Do you know which Sabrent heatsink in particular is good for cooling in a vector.
Unfortunately I don’t have the review on hand so I don’t remember what model heatsink it was. I’ve also not tried it or had a customer try it in a vector, so I can’t speak to how easy it is to install in the Vector. If you try it and it works/doesn’t work, please update us on the result as it will be helpful for future customers if they run into a similar error.
I think I found the review. Do you have a sense of what height I can sustain for the heatsink under the GPUs?
Oh if the NVME in question is under the GPUs then the Sabrent will not fit unfortunately. For under GPU NVMEs you’ll need a flat heatsink.
Are there other slots on the motherboard that aren’t underneath the GPU? I could try moving it.
What motherboard is it? WRX80 or TRX40?
I believe it’s a TRX40