CUDA_ERROR_OUT_OF_MEMORY on tensorbook

HI,

Just got a Tensorbook and I’m trying it out using Mask_RCNN (GitHub - matterport/Mask_RCNN: Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow). COCO datasets are installed on the tensorbook as well. I’m getting the following error when trying a jupyter notebook (e.g. samples/coco/inspect_model.ipynb): (FYI, the output is a lot bigger, but it’s the same messages repeating)

2018-10-12 13:13:44.563742: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 262144 totalling 256.0KiB
2018-10-12 13:13:44.563745: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 1327104 totalling 1.27MiB
2018-10-12 13:13:44.563749: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 2097152 totalling 2.00MiB
2018-10-12 13:13:44.563753: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 69088512 totalling 65.89MiB
2018-10-12 13:13:44.563756: I tensorflow/core/common_runtime/bfc_allocator.cc:658] Sum Total of in-use chunks: 69.81MiB
2018-10-12 13:13:44.563761: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Stats:
Limit: 81330176
InUse: 73197312
MaxInUse: 73197312
NumAllocs: 181
MaxAllocSize: 69088512

2018-10-12 13:13:44.563773: W tensorflow/core/common_runtime/bfc_allocator.cc:275] ****************************************************************************xxxxxxxxxxxxxxxxxxxxxxxx
2018-10-12 13:13:44.563786: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at assign_op.h:117 : Resource exhausted: OOM when allocating tensor with shape[256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2018-10-12 13:13:44.564202: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 7.76M (8132864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2018-10-12 13:13:44.564591: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 7.76M (8132864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Any ideas?

Thanks,
-Skip

Sounds like you’re allocating too much memory. Try reducing your batch size.

Thanks.

Reducing the batch size (from 2 to 1) didn’t work, but switching from resnet101 to resnet150 network worked.

After the fact, I found the authors’ wiki where they recommend using a smaller backbone network:

-Skip

I mean resnet50, of course

1 Like

Yes, sounds like it was an OOM error w/ ResNet-152. Glad you fixed it.