Tensorrt converted weights not working with supervision

I am working on a computer vision project and for that i trained a custom model on top yolov8l-seg
I am getting 6-7 fps when I run inference and in order to increase the fps I converted the weights to tensorrt which created the best.engine file.
When I run on inference using best.pt it works fine and I don’t get any error but I am getting following error when I use best.engine

Here is the code

model = YOLO(".\best.engine")

logging.basicConfig(filename='output.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

class CustomSink:
    def __init__(self, weights_path: str, zone_configuration_path: str, classes: List[int]):
        self._model = YOLO(weights_path)
        self.classes = classes
        self.tracker = sv.ByteTrack(minimum_matching_threshold=0.5)
        self.fps_monitor = sv.FPSMonitor()
        self.polygons = load_zones_config(file_path=zone_configuration_path)
        self.timers = [ClockBasedTimer() for _ in self.polygons]
        self.zones = [
            sv.PolygonZone(
                polygon=polygon,
                triggering_anchors=(sv.Position.CENTER,),
            )
            for polygon in self.polygons
        ]

    def infer(self, video_frames: List[VideoFrame]) -> List[any]: 
        # result must be returned as list of elements representing model prediction for single frame
        # with order unchanged.
        return self._model([v.image for v in video_frames])

    def on_prediction(self, result: dict, frame: VideoFrame) -> None:
        self.fps_monitor.tick()
        fps = self.fps_monitor.fps
        # modify the following code to adjust 
        detections = sv.Detections.from_ultralytics(result)
        detections = detections[find_in_list(detections.class_id, self.classes)]
        detections = self.tracker.update_with_detections(detections)

        annotated_frame = frame.image.copy()

        annotated_frame = sv.draw_text(
            scene=annotated_frame,
            text=f"{fps:.1f}",
            text_anchor=sv.Point(40, 30),
            background_color=sv.Color.from_hex("#A351FB"),
            text_color=sv.Color.from_hex("#000000"),
        )

        for idx, zone in enumerate(self.zones):
            annotated_frame = sv.draw_polygon(
                scene=annotated_frame, polygon=zone.polygon, color=COLORS.by_idx(idx)
            )

            detections_in_zone = detections[zone.trigger(detections)]
            time_in_zone = self.timers[idx].tick(detections_in_zone)
            custom_color_lookup = np.full(detections_in_zone.class_id.shape, idx)

            annotated_frame = COLOR_ANNOTATOR.annotate(
                scene=annotated_frame,
                detections=detections_in_zone,
                custom_color_lookup=custom_color_lookup,
            )
            labels = [
                f"#{tracker_id} {int(time // 60):02d}:{int(time % 60):02d}"
                for tracker_id, time in zip(detections_in_zone.tracker_id, time_in_zone)
            ]
            annotated_frame = LABEL_ANNOTATOR.annotate(
                scene=annotated_frame,
                detections=detections_in_zone,
                labels=labels,
                custom_color_lookup=custom_color_lookup,
            )
        cv2.imshow("Processed Video", annotated_frame)
        # cv2.waitKey(1)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            cv2.destroyAllWindows()
            raise SystemExit("Program terminated by user")
    

def main(
    weight_path: str,
    rtsp_url: str,
    zone_configuration_path: str,
    model_id: str,
    confidence: float,
    iou: float,
    classes: List[int],
) -> None:
    sink = CustomSink(weights_path=weight_path ,zone_configuration_path=zone_configuration_path, classes=classes)

    pipeline = InferencePipeline.init_with_custom_logic(
        # pass custom model 
        video_reference=rtsp_url,
        on_video_frame=sink.infer,
        on_prediction=sink.on_prediction,
        # confidence=confidence,
        # iou_threshold=iou,
    )

    pipeline.start()

    try:
        pipeline.join()
    except KeyboardInterrupt:
        pipeline.terminate()

I am passing rtsp_url, path to weights etc as command line arguments.

What is the issue?
I have Nvidia RTX 470 Laptop GPU
Operating system Windows 11

Hello there,

Issue that you encountered seems to be quite similar to the one reported here: Export to TensorRT KeyError · Issue #6471 · ultralytics/ultralytics · GitHub

Seems like that was the solution: Export to TensorRT KeyError · Issue #6471 · ultralytics/ultralytics · GitHub

As you are probably using object-detection model I suggest trying:

self._model = YOLO(weights_path, task="detect")

and start debugging the problem by getting YOLO model to work alone, then move on to wrapping it with InferencePipeline.

Yes this is the fix. I had to pass task="segment" since I am segmenting. But even after converting the model to tensorrt I am unable to improve the fps. I am getting 4-5 fps for segementation on my 470 nvidia laptop gpu

There are few details I need to know to help you:

  • what is the model (size)
  • what is the resolution of the footage
  • what is the source of footage - video file / USB camera / rtsp stream. If one of last two - I need to know the parameters (fps of the source)
  1. I trained model on custom dataset. I used yolov8l-sg as the base model.
  2. I trained the model on default resolution (640)
  3. I am getting the live camera feed through rtsp url and I have video at 1920*1080
    around 17 fps

when I use object detection model I get around 25 fps for the same stream.
i converted to tensorrt to get better results. What possible changes can I make to get better fps?

I am not sure how TensorRT conversion works, I assume that it requires image to be 640px, and once you provide 1080p in the script, footage gets downsized - but this is something you could verify.

I would start from checking what may be a bottleneck in this case.

Could you first grab some video frames from camera and run for-loop in python script measuring time that the inference from the model takes - by doing so u would mimic what is done here:

    def infer(self, video_frames: List[VideoFrame]) -> List[any]: 
        # result must be returned as list of elements representing model prediction for single frame
        # with order unchanged.
        return self._model([v.image for v in video_frames])

just without inference pipeline and real-time video processing. Please take a look at the actual input resolution for the model and utilisation of GPU.
If you see that results are produced in fast pace - we would go deeper to see if bottleneck is in InferencePipeline

I ran the following code

import cv2
import time
from ultralytics import YOLO
import numpy as np

# Initialize the YOLO model
model = YOLO("C:\\Users\\mubas\\OneDrive\\Desktop\\ultralytics\\segmentation\\seg-trained-model-weights\\best.engine", task='segment')

def capture_and_infer(rtsp_url: str, num_frames: int) -> None:
    cap = cv2.VideoCapture(rtsp_url)
    if not cap.isOpened():
        print("Error: Unable to open video stream")
        return

    for i in range(num_frames):
        ret, frame = cap.read()
        if not ret:
            print("Error: Unable to read frame from video stream")
            break

        # Start measuring time
        start_time = time.time()

        # Perform inference
        results = model([frame])

        # End measuring time
        inference_time = time.time() - start_time
        print(f"Frame {i+1}: Inference time = {inference_time:.4f} seconds")

        # Process results if needed (this is just an example)
        for result in results:
            pass

    cap.release()
    cv2.destroyAllWindows()

if __name__ == "__main__":
    rtsp_url = "rtsp://admin:<password>@192.168.0.126:554/Streaming/Channels/2001"
    num_frames = 10  
    capture_and_infer(rtsp_url, num_frames)

and I am getting this output

[06/11/2024-11:09:40] [TRT] [I] Loaded engine size: 258 MiB
[06/11/2024-11:09:40] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +99, now: CPU 0, GPU 352 (MiB)
[06/11/2024-11:09:40] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading

0: 640x640 (no detections), 39.5ms
Speed: 29.4ms preprocess, 39.5ms inference, 621.8ms postprocess per image at shape (1, 3, 640, 640)
Frame 1: Inference time = 3.5867 seconds

0: 640x640 16 cars, 1 number_plate, 39.0ms
Speed: 3.0ms preprocess, 39.0ms inference, 1056.5ms postprocess per image at shape (1, 3, 640, 640)
Frame 2: Inference time = 1.1005 seconds

0: 640x640 16 cars, 1 number_plate, 40.0ms
Speed: 2.0ms preprocess, 40.0ms inference, 6.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 3: Inference time = 0.0490 seconds

0: 640x640 16 cars, 1 number_plate, 38.5ms
Speed: 3.0ms preprocess, 38.5ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 4: Inference time = 0.0465 seconds

0: 640x640 16 cars, 1 number_plate, 43.0ms
Speed: 2.0ms preprocess, 43.0ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 5: Inference time = 0.0490 seconds

0: 640x640 16 cars, 2 number_plates, 42.2ms
Speed: 2.0ms preprocess, 42.2ms inference, 4.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 6: Inference time = 0.0502 seconds

0: 640x640 16 cars, 1 number_plate, 42.0ms
Speed: 3.0ms preprocess, 42.0ms inference, 4.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 7: Inference time = 0.0500 seconds

0: 640x640 16 cars, 2 number_plates, 41.0ms
Speed: 2.0ms preprocess, 41.0ms inference, 4.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 8: Inference time = 0.0490 seconds

0: 640x640 16 cars, 2 number_plates, 42.0ms
Speed: 3.0ms preprocess, 42.0ms inference, 4.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 9: Inference time = 0.0530 seconds

0: 640x640 16 cars, 2 number_plates, 41.0ms
Speed: 3.0ms preprocess, 41.0ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 10: Inference time = 0.0490 seconds

ok, great

so you should be able to reach 20 FPS running raw processing in TRT inferring against 640x640 image, at least taking into GPU throughput.

Lets do 2 things.

  1. Could you please denote whole time of frame processing:
    for i in range(num_frames):
        frame_start = time.time()
        ret, frame = cap.read()
        if not ret:
            print("Error: Unable to read frame from video stream")
            break

        # Start measuring time
        start_time = time.time()

        # Perform inference
        results = model([frame])

        # End measuring time
        inference_time = time.time() - start_time
        print(f"Frame {i+1}: Inference time = {inference_time:.4f} seconds")

        # Process results if needed (this is just an example)
        for result in results:
            # Do something with the result if needed
            pass
       total_frame_time = time.time() - frame_start
       print(f"Frame {i+1}: total time = {total_frame_time:.4f} seconds")

That would make it clear how long it takes to a) grab and decode frame b) make inference

  1. Having that done I would compare it to InferencePipeline with sink:
from datetime import datetime


def debug_on_prediction(result: dict, frame: VideoFrame):
    latency = (datetime.now() - frame.frame_timestamp).total_seconds()
    print(f"E2E latency inference pipeline: {round(latency, 4)}s")


# then change sink in ur original code
pipeline = InferencePipeline.init_with_custom_logic(
        # pass custom model 
        video_reference=rtsp_url,
        on_video_frame=sink.infer,
        on_prediction=debug_on_prediction,
)

The reason I ask for that is the following:

  1. we need to see how performant ur setup is in processing frame from the start to the end - it may happen that ur GPU would process 20frames a second, but grabbing and decoding frames may be limiting factor
  2. Second exercise checks the same using InferencePipeline as decoding platform, but without sink logic from the original snippet - tracking + zones post-processing

And if you can please run more than 10 frames, 100 would probably provide more stable estimate

0: 640x640 (no detections), 39.0ms
Speed: 30.6ms preprocess, 39.0ms inference, 630.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 1: Inference time = 4.6094 seconds

0: 640x640 18 cars, 40.2ms
Speed: 3.0ms preprocess, 40.2ms inference, 1077.1ms postprocess per image at shape (1, 3, 640, 640)
Frame 2: Inference time = 1.1223 seconds

0: 640x640 18 cars, 39.7ms
Speed: 2.0ms preprocess, 39.7ms inference, 4.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 3: Inference time = 0.0476 seconds

0: 640x640 18 cars, 39.1ms
Speed: 2.0ms preprocess, 39.1ms inference, 3.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 4: Inference time = 0.0451 seconds

0: 640x640 18 cars, 41.9ms
Speed: 5.0ms preprocess, 41.9ms inference, 6.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 5: Inference time = 0.0549 seconds

0: 640x640 18 cars, 46.0ms
Speed: 6.0ms preprocess, 46.0ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 6: Inference time = 0.0550 seconds

0: 640x640 18 cars, 41.7ms
Speed: 3.0ms preprocess, 41.7ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 7: Inference time = 0.0477 seconds

0: 640x640 18 cars, 44.0ms
Speed: 2.0ms preprocess, 44.0ms inference, 4.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 8: Inference time = 0.0520 seconds

0: 640x640 18 cars, 43.7ms
Speed: 2.0ms preprocess, 43.7ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 9: Inference time = 0.0488 seconds

0: 640x640 19 cars, 44.3ms
Speed: 2.0ms preprocess, 44.3ms inference, 2.0ms postprocess per image at shape (1, 3, 640, 640)
Frame 10: Inference time = 0.0493 seconds
Frame 10: total time = 0.0538 seconds

for the second

Speed: 2.5ms preprocess, 44.7ms inference, 1143.3ms postprocess per image at shape (1, 3, 640, 640)
Speed: 7.0ms preprocess, 43.1ms inference, 6.0ms postprocess per image at shape (1, 3, 640, 640)
Speed: 5.0ms preprocess, 44.7ms inference, 7.0ms postprocess per image at shape (1, 3, 640, 640)
E2E latency inference pipeline: 0.1069s

Also I keep getting this warning so its very hard to read out
SupervisionWarnings: __call__ is deprecated: FPSMonitor.callis deprecated and will be removed insupervision-0.22.0. Use FPSMonitor.fps instead.
and I can’t stop the program unless I kill the terminal I tried ctrl c , ctrl x still can’t stop the program

SupervisionWarnings: __call__ is deprecated: `FPSMonitor.__call__` is deprecated and will be removed in `supervision-0.22.0`. Use `FPSMonitor.fps` instead.
SupervisionWarnings: __call__ is deprecated: `FPSMonitor.__call__` is deprecated and will be removed in `supervision-0.22.0`. Use `FPSMonitor.fps` instead.
SupervisionWarnings: __call__ is deprecated: `FPSMonitor.__call__` is deprecated and will be removed in `supervision-0.22.0`. Use `FPSMonitor.fps` instead.
0: 640x640 19 cars, 43.0ms
Speed: 3.0ms preprocess, 43.0ms inference, 63.5ms postprocess per image at shape (1, 3, 640, 640)

Got this error

Error Code 1: Cuda Runtime (out of memory
Error Code 1: Cuda Driver (out of memory)

Ok,

as this is multi-threading code, termination of InferencePipeline requires:

    pipeline.start()

    try:
        pipeline.join()
    except KeyboardInterrupt:
        pipeline.terminate()

To avoid cuda errors, given not terminated process you may need to kill processes occupying VRAM. To do it, please find python processes using nvidia-smi and kill them, that should de-allocate memory.

Also, you provided dump where for first case:
Frame {i}: total time = ... seconds appears once, not for each frame

and for InferencePipeline you have few dumps of
Speed: 2.5ms preprocess, 44.7ms inference, 1143.3ms postprocess per image at shape (1, 3, 640, 640)
with only one entry of E2E latency inference pipeline: 0.1069s
please run it longer such that we have insights in process behaviour after everything starts and stabilises. I would say something is wrong with 100ms E2E given model takes ~50, but to evaluate root cause I would need to see if that is only temporary state at startup or something that is present through whole processing

this is may main function I have the keyboard interrupt setup properly

def main(
    weight_path: str,
    rtsp_url: str,
    zone_configuration_path: str,
    model_id: str,
    confidence: float,
    iou: float,
    classes: List[int],
) -> None:
    sink = CustomSink(weights_path=weight_path ,zone_configuration_path=zone_configuration_path, classes=classes)

    pipeline = InferencePipeline.init_with_custom_logic(
        # pass custom model 
        video_reference=rtsp_url,
        on_video_frame=sink.infer,
        on_prediction=debug_on_prediction
        # on_prediction=sink.on_prediction,
        # confidence=confidence,
        # iou_threshold=iou,
    )

    pipeline.start()

    try:
        pipeline.join()
    except KeyboardInterrupt:
        pipeline.terminate()


I keep getting fps_call__ deprecated warning in my terminal which covers all the logs

see

ok, it is fine,
to remove warnings: export SUPERVISON_DEPRECATION_WARNING=0

I see that there is latency introduced by presence of InferencePipeline - not sure atm why, maybe this reveals weakness of implementation that can be removed, I would need to take a closer look.

signals are not handled properly, probably due to execution of non-python threads under the hood, this is also something I would need to take a closer look.

Let’s do one final thing as in InferencePipeline latency != throughput due to threading involved.

MONITOR = sv.FPSMonitor()

def debug_on_prediction(result: dict, frame: VideoFrame, monitor: sv.FPSMonitor = MONITOR):
    MONITOR.tick()
    latency = (datetime.now() - frame.frame_timestamp).total_seconds()
    print(f"E2E latency inference pipeline: {round(latency, 4)}s, throughput: {MONITOR.fps} fps")

This is to verify throughput without polygons tracing.

I bet this would show the value u were reporting - around 7-8 FPS.

Given this is the case, I would have suggestions to tune performance of inference pipeline, but most likely we have something to improve there, such that we do not introduce so huge latency

alright added the export using

import os
os.environ['SUPERVISON_DEPRECATION_WARNING'] = '0'

in case someone in future comes here