Roboflow model training and testing performance seriously inconsistent, and multi-model inference anomaly

yu_H · May 22, 2025, 3:47pm

Roboflow model training and testing performance seriously inconsistent, and multi-model inference anomaly
Hello everyone, I have encountered a series of confusion and problems while using Roboflow platform for target detection model training and testing, and would like to ask the community and official team for advice:

Serious inconsistency between training and testing performance

When I train the model on Roboflow platform, the training report shows that the mAP metric of the test set is as high as 99%, and the mAP of both types of targets is very high.
However, when I perform inference with the same test set images on the test page (Visualize/Deploy page), I find that one of the categories of targets is almost unrecognizable, with a confidence level of only 1-2%, which is completely inconsistent with the high mAP in the training report.
My dataset images are not duplicated, and I used Roboflow to automatically segment the dataset, so theoretically there is no data leakage.
Even if there is overfitting, the model should not perform so poorly on the test set images.
May I ask: Shouldn’t the model performance theoretically be the same on the test set used for training and the images I manually tested on the test page? Why is there such a big difference?
2. Preprocessing and inference process questions

I noticed that Roboflow automatically resizes the images during training, but I’m not sure if the test page/model preview is also automatically resized.
If the test page doesn’t do the same preprocessing, will this cause the inference results to be inconsistent with the training report?
3. Multi-model inference performance differences

I have trained and tested YOLOv12, YOLOv11, YOLO-NAS and Roboflow3.0 on the same dataset.
As a result, only YOLOv11’s inference performs normally, while the other models perform abnormally on the test page and fail to recognize targets effectively.
The training report metrics of these models are all very high, but the actual inference results vary greatly.
Can you tell me why only YOLOv11 performs normally and the other models have abnormal inference results? Is there any compatibility or pre-processing differences in the platform when reasoning with multiple models?
I hope the community or official can help to answer these questions, and welcome friends who have similar experience to share their experience, thank you very much!

Project Type:Object Detection
Operating System & Browser:MacOS & Safari
Project ID:background-removal-object-detection-gfhnn

system · June 12, 2025, 3:48pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Locally depoloyed model gives different result from the hosted one Community Help	7	434	December 18, 2023
Uploaded Yolo v8 model working incorrectly in Label Assist Community Help	10	265	September 25, 2023
Very poor performance using model weights Community Help	7	62	May 15, 2025
Difference Roboflow Model and Yolo8 Community Help	7	397	January 4, 2024
Questions about Roboflow Trained model! and i want to know why my own trained model is suck Community Help	4	370	November 17, 2023

Roboflow model training and testing performance seriously inconsistent, and multi-model inference anomaly

Related topics