Performance metrics calculated manually do not match metrics in UI

Kyle_Marshall · May 5, 2025, 8:24pm

I am writing a script to calculate performance metrics on an eval dataset using a model I trained in the Roboflow UI. So far I cannot get my MAR and MAP metrics to match what are given in the UI after a model finishes training. I am using a YOLOv11 model and dataset. Anyone have an idea why the discrepancy?

MAP @ 50: 0.4156380482177502
MAR @ 100: 0.2680126076647888
F1score @ 50: 0.7939534061499741

def predict_image(image_path, visualize=True):
    url = f"http://localhost:9001/{MODEL_ID}/{MODEL_VERSION}"
    headers = {"Content-Type": "application/json"}
    params = {
        "api_key": API_KEY,
        "confidence": 0.5,
    }
    with open(image_path, "rb") as image_file:
        encoded_image = base64.b64encode(image_file.read()).decode("utf-8")

    image = cv2.imread(image_path)
    response = requests.post(
        url, headers=headers, params=params, data=encoded_image
    ).json()
    detections = sv.Detections.from_inference(response)
    if visualize:
        box_annotator = sv.BoxAnnotator()
        label_annotator = sv.LabelAnnotator()
        labels = [
            f"{class_name} {det_confidence:0.2f}"
            for det_confidence, class_name in zip(detections.confidence, detections.data["class_name"])
        ]
        annotated_image = box_annotator.annotate(
            image, detections=detections
        )

        annotated_image = label_annotator.annotate(
            annotated_image, detections=detections, labels=labels
        )

        cv2.imshow("Annotated image", annotated_image)
        cv2.waitKey(0)
    return detections


def iter_dataset(path, dataset_dirs=["test", "valid"]):
    path = Path(path)
    for ds_dir in dataset_dirs:
        ds_path = path / ds_dir
        ds = sv.DetectionDataset.from_yolo(
            images_directory_path=str(ds_path / "images"),
            annotations_directory_path=str(ds_path / "labels"),
            data_yaml_path=str(path / "data.yaml"),
        )
        for img_path, img_arr, detections in ds:
            yield img_path, img_arr, detections


def perform_prediction(dataset_path, visualize=False, count=None):
    all_predictions = []
    all_targets = []
    current_count = 0
    for idx, (image_file, img_arr, gt_detections) in enumerate(
        iter_dataset(dataset_path)
    ):
        if count:
            if current_count == count:
                break
            current_count += 1
        detections = predict_image(image_file, visualize=visualize)
        all_predictions.append(detections)
        all_targets.append(gt_detections)
    return all_targets, all_predictions


def calc_metrics(targets, predictions):
    # Calculate precision
    precision_metric = MeanAveragePrecision(metric_target=MetricTarget.MASKS)
    precision_result = precision_metric.update(predictions, targets).compute()
    print(f"MAP @ 50: {precision_result.map50_95}")

    # Calculate recall
    recall_metric = MeanAverageRecall(metric_target=MetricTarget.MASKS)
    recall_result = recall_metric.update(predictions, targets).compute()
    print(f"MAR @ 100: {recall_result.mAR_at_100}")

    # Calculate F1 score
    f1_metric = F1Score(metric_target=MetricTarget.MASKS)
    f1_result = f1_metric.update(predictions, targets).compute()
    print(f"F1score @ 50: {f1_result.f1_50}")

    return precision_result, recall_result, f1_result




if __name__ == "__main__":
    dataset_path = sys.argv[1]
    targets, predictions = perform_prediction(
        dataset_path,
        # count=10,
        # visualize=True,
    )

    precision, recall, f1_score = calc_metrics(targets, predictions)

isaacrob · May 5, 2025, 8:29pm

There are two likely sources of a discrepancy in your code:

We report ultralytics-based mAP, which does not return the same value as pycocotools-based mAP. Where did you source your mAP implementation? I don’t see an import statement.
You’re reporting mAP@50:95, whereas we report mAP@50.

Kyle_Marshall · May 5, 2025, 9:09pm

Thanks for reply!

I am using Roboflow’s supervision library. My import statement:

from supervision.metrics import (
    MeanAverageRecall,
    MeanAveragePrecision,
    MetricTarget,
    F1Score,
)

I updated the reported metric to map50 and it did increase a bit to 0.64. I also am using masks as my metric target instead of bboxes though I am unsure which the UI uses to calculate the metrics.

isaacrob · May 6, 2025, 1:06am

If you’re training an instance segmentation model, then our metric should be using masks. What mAP does the UI report? I’m not seeing it in your post.

Kyle_Marshall · May 6, 2025, 2:41am

I posted a screenshot from the UI where it reports precision to be 91% which I assume is MAP since it’s a multi-class model.

system · May 13, 2025, 2:42am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Accuracy metrics for test dataset Community Help	4	852	December 30, 2023
Discrepancy in Annotation Decimal Precision & Partial Dataset Download in Roboflow Community Help formats	1	37	September 23, 2024
Annotations get mismatched when exported Community Help	5	405	September 6, 2023
Test data annotation och evaluation in TensorBoard Community Help	1	211	March 1, 2022
mAP for test set Community Help	1	310	July 14, 2023

Performance metrics calculated manually do not match metrics in UI

Related topics