Random runtime CUDA error on Lambda machine

I am training EfficientDet (automl/efficientdet at master · google/automl · GitHub) with small modification. I get CUDA runtime error randomly on a Lambda machine. Sometimes after a few hours of training or sometimes after a few days. I can be sure that the training is running on a dedicated GPU without interruption from other users or scripts. The typical error messages are attached to the end.
I am using:

  1. Python 3.6.12
  2. Tensorflow 2.3.0
  3. Anaconda 4.13.0
  4. CUDNN 7.6.5 (obtained by ‘conda list cudnn’)
  5. CUDA 10.1.243 (obtained by 'conda list cudatoolkit), 11.6 (obtained by ‘nvidia-smi’), 11.1.105 (obtained by ‘nvcc --version’). I don’t know which one is the correct or actual one.

Let me know if you need any other information. Thanks very much for any feedback.

Error message:

2023-01-04 21:25:37.791416: F ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:647] Non-OK-status: GpuLaunchKernel(BlockReduceKernel<IN_T, OUT_T, num_threads, Op>, num_blocks, num_threads, 0, cu_stream, in, out, in_size, op, init) status: Internal: an illegal memory access was encountered
Fatal Python error: Aborted

Thread 0x00007f69a26f3700 (most recent call first):
File “/home/zj/miniconda3/envs/distra2023-01-04 21:25:37.791458: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
ction_env_tf23/lib/pytho2023-01-04 21:25:37.791519: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:220] Unexpected Event status: 1
n3.6/threading.py”, line 295 in wait
File “/home/zj/miniconda3/envs/distraction_env_tf23/lib/python3.6/site-packages/tensorflow/python/summary/writer/event_file_writer.py”, line 266 in get