Please share the following so we may better assist you:
-
Screen shot of your error
image|690x387 -
File “train.py”, line 616, in
train(hyp, opt, device, tb_writer)
File “train.py”, line 372, in train
scaler.scale(loss).backward()
File “/usr/local/lib/python3.7/dist-packages/torch/_tensor.py”, line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/usr/local/lib/python3.7/dist-packages/torch/autograd/init.py”, line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA out of memory. Tried to allocate 4.28 GiB (GPU 0; 14.76 GiB total capacity; 4.28 GiB already allocated; 4.28 GiB free; 9.15 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF -
Still finding a way to debug. Pls, any expert can help me to solve this issue. Strange is when I edit the batch number from 16 to 12 then it can train as well.