RF-DETR - Latency Results Do Not Match Paper

Hello,

I am benchmarking the model reported in the paper and am seeing a noticeable mismatch between the latency numbers reported in the paper and the latency I measure in practice.

Setup details:

  • Batch size: 1

  • GPU: NVIDIA RTX PRO 4500 (Blackwell)

  • Inference focused (no training, no data loading overhead)

Despite matching the batch size and using a modern high-end GPU, the measured latency is consistently higher than what is reported in the paper. I want to confirm:

  1. Whether the paper’s latency numbers were measured with any specific assumptions (e.g., mixed precision, TensorRT, specific CUDA/cuDNN versions, or warmup strategy).

  2. Whether preprocessing/postprocessing was excluded from the reported latency.

  3. If Flash Attention, fused kernels, or other backend-specific optimizations were explicitly enabled.

  4. Whether the reported numbers reflect end-to-end latency or pure model forward time.

Any clarification on the exact benchmarking methodology used in the paper would be very helpful, as I am trying to reproduce the results as closely as possible.

Thank you!

The paper says the specific conditions under which latency is measured, see section 4 and the appendix on CUDA graphs. As mentioned in the paper, we use TensorRT 10.4 and CUDA 12.4 on a T4 GPU with FP16 and CUDA graphs enabled. We measure model forward time, which for a DETR is the end to end time.

As is also mentioned in the paper, we take a close look at power throttling as a source of inconsistencies in latency measurements in different papers and propose a solution. To facilitate research, we linked to an open source repo containing code to exactly reproduce our claimed latencies and accuracies.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.