RF-DETR ValueError: matrix contains invalid numeric entries

Hi there!

Running the segmentation preview of RF-DETR seems to be causing a ValueError: matrix contains invalid numeric entries.

This is the error output:


ValueError Traceback (most recent call last)
Cell In[5], line 1
----> 1 model.train(
2 dataset_dir=str(Path(full_seg_dataset_name)),
3 epochs=30,
4 batch_size=1,
5 grad_accum_steps=4,
6 lr=0.01,
7 imgsz=960, # multiple of 24 for segmentation, multiple of 32* for detection
8 output_dir=str(Path(“runs/full_rfdetr”))
9 )

File ~\AppData\Roaming\Python\Python313\site-packages\rfdetr\detr.py:83, in RFDETR.train(self, **kwargs)
79 “”"
80 Train an RF-DETR model.
81 “”"
82 config = self.get_train_config(**kwargs)
—> 83 self.train_from_config(config, **kwargs)

File ~\AppData\Roaming\Python\Python313\site-packages\rfdetr\detr.py:191, in RFDETR.train_from_config(self, config, **kwargs)
182 early_stopping_callback = EarlyStoppingCallback(
183 model=self.model,
184 patience=config.early_stopping_patience,
(…) 187 segmentation_head=config.segmentation_head
188 )
189 self.callbacks[“on_fit_epoch_end”].append(early_stopping_callback.update)
→ 191 self.model.train(
192 **all_kwargs,
193 callbacks=self.callbacks,
194 )

File ~\AppData\Roaming\Python\Python313\site-packages\rfdetr\main.py:341, in Model.train(self, callbacks, **kwargs)
339 model.train()
340 criterion.train()
→ 341 train_stats = train_one_epoch(
342 model, criterion, lr_scheduler, data_loader_train, optimizer, device, epoch,
343 effective_batch_size, args.clip_max_norm, ema_m=self.ema_m, schedules=schedules,
344 num_training_steps_per_epoch=num_training_steps_per_epoch,
345 vit_encoder_num_layers=args.vit_encoder_num_layers, args=args, callbacks=callbacks)
346 train_epoch_time = time.time() - epoch_start_time
347 train_epoch_time_str = str(datetime.timedelta(seconds=int(train_epoch_time)))

File ~\AppData\Roaming\Python\Python313\site-packages\rfdetr\engine.py:130, in train_one_epoch(model, criterion, lr_scheduler, data_loader, optimizer, device, epoch, batch_size, max_norm, ema_m, schedules, num_training_steps_per_epoch, vit_encoder_num_layers, args, callbacks)
128 with autocast(**get_autocast_args(args)):
129 outputs = model(new_samples, new_targets)
→ 130 loss_dict = criterion(outputs, new_targets)
131 weight_dict = criterion.weight_dict
132 losses = sum(
133 (1 / args.grad_accum_steps) * loss_dict[k] * weight_dict[k]
134 for k in loss_dict.keys()
135 if k in weight_dict
136 )

File ~\AppData\Roaming\Python\Python313\site-packages\torch\nn\modules\module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs)
1749 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1750 else:
→ 1751 return self._call_impl(*args, **kwargs)

File ~\AppData\Roaming\Python\Python313\site-packages\torch\nn\modules\module.py:1762, in Module._call_impl(self, *args, **kwargs)
1757 # If we don’t have any hooks, we want to skip the rest of the logic in
1758 # this function, and just call forward.
1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1760 or _global_backward_pre_hooks or _global_backward_hooks
1761 or _global_forward_hooks or _global_forward_pre_hooks):
→ 1762 return forward_call(*args, **kwargs)
1764 result = None
1765 called_always_called_hooks = set()

File ~\AppData\Roaming\Python\Python313\site-packages\rfdetr\models\lwdetr.py:537, in SetCriterion.forward(self, outputs, targets)
534 outputs_without_aux = {k: v for k, v in outputs.items() if k != ‘aux_outputs’}
536 # Retrieve the matching between the outputs of the last layer and the targets
→ 537 indices = self.matcher(outputs_without_aux, targets, group_detr=group_detr)
539 # Compute the average number of target boxes accross all nodes, for normalization purposes
540 num_boxes = sum(len(t[“labels”]) for t in targets)

File ~\AppData\Roaming\Python\Python313\site-packages\torch\nn\modules\module.py:1751, in Module._wrapped_call_impl(self, *args, **kwargs)
1749 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1750 else:
→ 1751 return self._call_impl(*args, **kwargs)

File ~\AppData\Roaming\Python\Python313\site-packages\torch\nn\modules\module.py:1762, in Module._call_impl(self, *args, **kwargs)
1757 # If we don’t have any hooks, we want to skip the rest of the logic in
1758 # this function, and just call forward.
1759 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1760 or _global_backward_pre_hooks or _global_backward_hooks
1761 or _global_forward_hooks or _global_forward_pre_hooks):
→ 1762 return forward_call(*args, **kwargs)
1764 result = None
1765 called_always_called_hooks = set()

File ~\AppData\Roaming\Python\Python313\site-packages\torch\utils_contextlib.py:116, in context_decorator..decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
→ 116 return func(*args, **kwargs)

File ~\AppData\Roaming\Python\Python313\site-packages\rfdetr\models\matcher.py:152, in HungarianMatcher.forward(self, outputs, targets, group_detr)
150 for g_i in range(group_detr):
151 C_g = C_list[g_i]
→ 152 indices_g = [linear_sum_assignment(c[i]) for i, c in enumerate(C_g.split(sizes, -1))]
153 if g_i == 0:
154 indices = indices_g

ValueError: matrix contains invalid numeric entries

The training run crashes anywhere between epoch 2 to 20. I am also adding a screenshot of the terminal output to visualize where that happens:

To fix this I have tried to do the following:

  • Lower batch sizes from 16 → 8 → and now I’m at 4
  • Generated a script that pinpoints images in the dataset that might be formatted in a wrong way. From the original 1500 images in my dataset I am now down to 1250 since their bounding boxes seemed to be out of bounds. (Which is weird because this dataset is downloaded straight from the Roboflow application, the COCO segmentation annotation format.)
  • To get a better view of where the problem lies I also trained a RFDETRNano() instead of the RFDETRSegPreview() using the same dataset but in COCO annotation format (not COCO segmentation annotation format) directly downloaded from the Roboflow application. And this worked perfectly fine.

Hope y’all can help me out, thanks for reading ! :slight_smile:

  • Project Type: Segmentation
  • Operating System & Browser: Windows
  • Project Universe Link or Workspace/Project ID: aiprojectretrofiteurope/rooftops-all-ugayi

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.