It is about training RF-DETR locally

Project Type: Object detection

Operating System & Browser: Windows 11 and Ubuntu 22.14

Project Universe Link or Workspace/Project ID: NA

The Challenges of Training Non-Base RF-DETR Variants Locally

Not all RF-DETR variants are created equal when it comes to local training. I am developing models that perform detection in a healthcare application where patient identification is possible. I cannot share data in the cloud. I want to use the larger tier RF-DETR models to enhance accuracy, but I am experiencing difficulties with local training.

The Base RF-DETR is the only variant that consistently works offline. It retains its DINOv2 backbone, supports the windowed multi-scale encoder, allows flexible input resolutions, and enables the detection head to be re-initialized for a custom number of classes. This makes it the only practical option today for adapting RF-DETR to sensitive datasets that must remain on-premises.

By contrast, the Nano, Small, and Medium variants struggle in local environments. They often lose their DINOv2 backbones, lack multi-scale support, fail on resolution changes, and don’t cleanly allow new detection heads. While these other models perform well and are easy to train on Roboflow’s hosted service, their current limitations block their use in domains where data cannot leave secure systems. Expanding full local training support across all tiers would unlock the accuracy of larger models while preserving the flexibility of smaller ones.

Here is a code example:

from rfdetr import RFDETRMedium, RFDETRBase

def load_model():
model = RFDETRMedium(
)
return model

def train_model(model):
model.train(
dataset_dir= “Datasets_COCO/TTC_Annotated”,
num_classes=3,
device=“cuda”,
num_epochs=2,

    patch_size=14,
    multi_scale=True,

    #GPU dependent - 8 GiB VRAM
    batch_size=2,
    grad_accum_steps=8,
    resolution=784,

    lr=1e-4,

    output_dir="output",
)

if name == “main”:
model = load_model()
train_model(model)

With the base RF-DETR model, the pretrained DINOv2 weights are loaded and the detection head is re-initialized to 3 classes with out issues. When I use the Medium tier, the trainer will not load the DINOv2 backbone because of a different number of positional encodings than DINOv2 and changes the patch size to 16, which triggers a run time error because the resolution is not divisible by 32.

Is this by design? Are there code modifications that would allow me to train the medium variant locally and retain the backbone and resolution support?

Thank you.

mdunham7915

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.