Problems training EfficientDet

I’m trying to train EfficientDet using the CBIS-DDSM mammogram dataset. The Roboflow dataset I created has 3.4k training images, 75 validation images and 300 test images, resized to 512x512 from the native dimensions. Adapting the Roboflow-EfficientDet-v2 Colab notebook to this task using the default parameters, the onxx save failed after each epoch and I ran out of RAM after 23 epochs. I tried reducing the batch size to 4 and then 2, but in each case I now get DataLoader worker errors during the first epoch:

Epoch: 1/100. Iteration: 447/861. Cls loss: 0.54293. Reg loss: 0.67819. Batch loss: 1.22112 Total loss: 1.51785
52% 447/861 [01:47<01:34, 4.37it/s]

error Traceback (most recent call last)
in

/content/drive/MyDrive/Colab_Notebooks/Medicine/ObjectDetection/EfficientDet/Monk_Object_Detection/4_efficientdet/lib/train_detector.py in Train(self, num_epochs, model_output_dir)
258 epoch_loss =
259 progress_bar = tqdm(self.system_dict[“local”][“training_generator”])
→ 260 for iter, data in enumerate(progress_bar):
261 try:
262 self.system_dict[“local”][“optimizer”].zero_grad()

5 frames
/usr/local/lib/python3.7/dist-packages/torch/_utils.py in reraise(self)
459 # instantiate since we don’t know how to
460 raise RuntimeError(msg) from None
→ 461 raise exception
462
463

error: Caught error in DataLoader worker process 0.
Original Traceback (most recent call last):
File “/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py”, line 302, in _worker_loop
data = fetcher.fetch(index)
File “/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py”, line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py”, line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “Monk_Object_Detection/4_efficientdet/lib/src/dataset.py”, line 47, in getitem
img = self.load_image(idx)
File “Monk_Object_Detection/4_efficientdet/lib/src/dataset.py”, line 58, in load_image
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
cv2.error: OpenCV(4.6.0) /io/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function ‘cvtColor’

Separately, I tried using the Roboflow-TensorFlow2-Object-Detection.ipynb Colab notebook on the same dataset using EfficientDet-D0 and run aground at this step:

!python /content/models/research/object_detection/model_main_tf2.py \

–pipeline_config_path={pipeline_file}
–model_dir={model_dir}
–alsologtostderr
–num_train_steps={num_steps}
–sample_1_of_n_eval_examples=1
–num_eval_steps={num_eval_steps}

which terminates with:

Node: ‘EfficientDet-D0/model/stem_conv2d/Conv2D’ DNN library is not found. [[{{node EfficientDet-D0/model/stem_conv2d/Conv2D}}]] [Op:__inference__dummy_computation_fn_32318]

Suggestions appreciated!

Update – I was able to solve the problem with the data loader workers but the onxx saves still fail, and memory runs out:

Epoch: 1/100. Iteration: 1722/1722. Cls loss: 0.38577. Reg loss: 0.36669. Batch loss: 0.75246 Total loss: 1.37674
100% 1722/1722 [05:30<00:00, 6.43it/s]

Monk_Object_Detection/4_efficientdet/lib/src/model.py:251: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results.
if len(inputs) == 2:
faild onnx export
Monk_Object_Detection/4_efficientdet/lib/src/utils.py:84: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can’t record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
image_shape = np.array(image_shape)
Monk_Object_Detection/4_efficientdet/lib/src/utils.py:96: TracerWarning: torch.from_numpy results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
anchors = torch.from_numpy(all_anchors.astype(np.float32))
Epoch: 2/100. Iteration: 330/1722. Cls loss: 0.82499. Reg loss: 0.78566. Batch loss: 1.61065 Total loss: 1.19205
19% 330/1722 [01:02<04:24, 5.27it/s]
Epoch: 2/100. Iteration: 1722/1722. Cls loss: 0.56882. Reg loss: 0.57639. Batch loss: 1.14521 Total loss: 1.11802
100% 1722/1722 [05:24<00:00, 6.36it/s]


faild onnx export
Epoch: 32/100. Iteration: 1722/1722. Cls loss: 0.10137. Reg loss: 0.42984. Batch loss: 0.53121 Total loss: 0.30382
100% 1722/1722 [05:28<00:00, 6.20it/s]

faild onnx export
100% 1722/1722 [02:26<00:00, 11.77it/s]
CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.78 GiB total capacity; 14.30 GiB already allocated; 16.75 MiB free; 14.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.78 GiB total capacity; 14.30 GiB already allocated; 16.75 MiB free; 14.48 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF