Hey @Ford,
Ok, great! All of these tips pertain to the develop branch of RF-DETR and not necessarily the official release.
Issue 1:
The functionality to perform a test run once the model finishes training has been added to the develop branch. I’m not sure if this was intentional, but in implementing this feature, a requirement for a test set was introduced, even when run_test is set to False. The error below illustrates this issue:
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/test/_annotations.coco.json'
This seems to occur because dataset_test is being built and data_loader_test created regardless of the run_test flag’s value. According to the traceback, the failure happens at line 196 of main.py (see below):
dataset_test = build_dataset(image_set='test', args=args, resolution=args.resolution)
Also, if run_test is going to default to True, I believe there should be more graceful handling to inform the user that they need to either provide a test set or set run_test to False.
Issue 2:
There are two parts to this issue. I am working with a dataset that has some class imbalance, so I have been experimenting with some different techniques to try and account for this. One way that I’ve tried to adjust for my class imbalance is by using the varifocal loss function rather than the ia bce loss function.
model.train(use_varifocal_loss = True, ia_bce_loss = False,...)
The first error that I ran into when trying to use the varifocal loss function was…
RuntimeError: Index put requires the source and destination dtypes match, got BFloat16 for the destination and Float for the source.
This was resulting from a difference in data type between pos_ious (Float) and cls_iou_targets (BFloat16) in line 351 of lwdetr.py. When running on CPU, this error does not occur. I assume this is a result of some error in the Mixed Precision handling, but I haven’t looked into it too much. Interestingly, on CPU, cls_iou_targets is initialized to Float while on GPU it is initialized to BFloat16. Meanwhile, pos_ious is always Float. I was able to fix this issue by adding…
pos_ious = pos_ious.to(cls_iou_targets.dtype)
…right before line 351, but there may be a better way to fix it such as by initializing pos_ious to src_logits.dtype similar to how cls_iou_targets is handled.
The second error that I ran into when trying to use the varifocal loss function was…
File "line 543, in sigmoid_varifocal_loss
(1 - alpha) * (prob - targets).abs().pow(gamma) * \
~~~~~^~~~~~~~~
RuntimeError: The size of tensor a (3) must match the size of tensor b (4) at non-singleton dimension 2
I did some digging and found that this error is caused by how num_classes is handled in the SetCriterion class in lwdetr.py. I have 2 classes, but because I’m using the Roboflow version of the COCO JSON format which uses a super category buffer class, the detection head is being reinitialized to 3 classes. This num_classes variable is then used to initialize SetCriterion at line 693 in lwdetr.py. However, for some reason, 1 is being added to the num_classes value, making it 4. Later, at line 346, cls_iou_targets is initialized using this incorrect value for its shape. cls_iou_targets is then passed to the sigmoid_focal_loss function at line 512 as the targets argument, which causes a size mismatch error between prob and targets. I fixed the issue by preventing the code from adding 1 to num_classes at line 693…
criterion = SetCriterion(args.num_classes,...)
rather than
criterion = SetCriterion(args.num_classes + 1,...)
The code ran fine after that and the loss function seems to be working correctly. However, there may be a better fix or I could be missing something since I didn’t fully understand the original intent behind adding 1 to num_classes.
This was very convoluted, so if you would like, I’d be more than happy to set up a call to discuss these issues. Hope this helps.
Best,
Rhys