Hello,
I am benchmarking the model reported in the paper and am seeing a noticeable mismatch between the latency numbers reported in the paper and the latency I measure in practice.
Setup details:
-
Batch size: 1
-
GPU: NVIDIA RTX PRO 4500 (Blackwell)
-
Inference focused (no training, no data loading overhead)
Despite matching the batch size and using a modern high-end GPU, the measured latency is consistently higher than what is reported in the paper. I want to confirm:
-
Whether the paperβs latency numbers were measured with any specific assumptions (e.g., mixed precision, TensorRT, specific CUDA/cuDNN versions, or warmup strategy).
-
Whether preprocessing/postprocessing was excluded from the reported latency.
-
If Flash Attention, fused kernels, or other backend-specific optimizations were explicitly enabled.
-
Whether the reported numbers reflect end-to-end latency or pure model forward time.
Any clarification on the exact benchmarking methodology used in the paper would be very helpful, as I am trying to reproduce the results as closely as possible.
Thank you!