Roboflow model training and testing performance seriously inconsistent, and multi-model inference anomaly
Hello everyone, I have encountered a series of confusion and problems while using Roboflow platform for target detection model training and testing, and would like to ask the community and official team for advice:
- Serious inconsistency between training and testing performance
When I train the model on Roboflow platform, the training report shows that the mAP metric of the test set is as high as 99%, and the mAP of both types of targets is very high.
However, when I perform inference with the same test set images on the test page (Visualize/Deploy page), I find that one of the categories of targets is almost unrecognizable, with a confidence level of only 1-2%, which is completely inconsistent with the high mAP in the training report.
My dataset images are not duplicated, and I used Roboflow to automatically segment the dataset, so theoretically there is no data leakage.
Even if there is overfitting, the model should not perform so poorly on the test set images.
May I ask: Shouldn’t the model performance theoretically be the same on the test set used for training and the images I manually tested on the test page? Why is there such a big difference?
2. Preprocessing and inference process questions
I noticed that Roboflow automatically resizes the images during training, but I’m not sure if the test page/model preview is also automatically resized.
If the test page doesn’t do the same preprocessing, will this cause the inference results to be inconsistent with the training report?
3. Multi-model inference performance differences
I have trained and tested YOLOv12, YOLOv11, YOLO-NAS and Roboflow3.0 on the same dataset.
As a result, only YOLOv11’s inference performs normally, while the other models perform abnormally on the test page and fail to recognize targets effectively.
The training report metrics of these models are all very high, but the actual inference results vary greatly.
Can you tell me why only YOLOv11 performs normally and the other models have abnormal inference results? Is there any compatibility or pre-processing differences in the platform when reasoning with multiple models?
I hope the community or official can help to answer these questions, and welcome friends who have similar experience to share their experience, thank you very much!
- Project Type:Object Detection
- Operating System & Browser:MacOS & Safari
- Project ID:background-removal-object-detection-gfhnn