Getting error "Unable to determine the device handle for GPU0000:3E:00.0: Unknown Error"

Hi, I’m getting this gpu error while training. A system reboot can make the GPUs function again but the error shows up again after some random time (one day or so) of training. This is the log after running nvidia-bug-report.sh

Error message: “Unable to determine the device handle for GPU0000:3E:00.0: Unknown Error”