How to count detected classes in a video using detectron2 and supervision

I have created a video object detection model using detectron2. My detectron2 model is able to properly do inference in the video.
But I would like to now count the number of detected classes as the video progresses.

I have tried exploring supervision, but would like to get help on how to implement this.
Thanks in advance

Hey @Kamal_Moha

Could you clarify where you are at the moment? If you got your detections into Supervision, you could just find the length of the detections object:

# detections = sv.Detections()...
print(len(detections)) # Will print the number of detected objects

Hi @leo, I have managed to print the len of detections for each frame.

import numpy as np
import supervision as sv

video_info = sv.VideoInfo.from_video_path(SUBWAY_VIDEO_PATH)
# line_zone = sv.LineZone(sv.Point(600, 0), sv.Point(600, 300))
# initiate polygon zone
polygon = np.array([
    [44, 22],
    [52, 710],
    [1288, 714],
    [1292, 22],
    [48, 18]
])

zone = sv.PolygonZone(polygon=polygon, frame_resolution_wh=video_info.resolution_wh)


# initiate annotators
box_annotator = sv.BoundingBoxAnnotator(thickness=4)
zone_annotator = sv.PolygonZoneAnnotator(zone=zone, color=sv.Color.white(), 
                                         thickness=6, text_thickness=6, text_scale=4)

def process_frame(frame: np.ndarray, i: int) -> np.ndarray:
    print('frame', i)
    # extract video frame
    generator = sv.get_video_frames_generator(SUBWAY_VIDEO_PATH)
    iterator = iter(generator)
    frame = next(iterator)

    # detect
    outputs = predictor(frame)
    all_detections = sv.Detections.from_detectron2(outputs)
    #filtering for 'person' detection
    detections = all_detections[all_detections.class_id == 0]
    zone.trigger(detections=detections)

    # annotate
    box_annotator = sv.BoxAnnotator(thickness=4, text_thickness=4, text_scale=2)
    frame = box_annotator.annotate(scene=frame, detections=detections, skip_label=True)
    frame = zone_annotator.annotate(scene=frame)
    return frame

sv.process_video(source_path=SUBWAY_VIDEO_PATH, target_path=f"new-result.mp4", 
                 callback=process_frame)

Running the above code in Kaggle notebooks produces the below video frame;

The video it produces has the below issues that I would like help on;

  1. sv.process_video() has only made predictions on the first frame neglecting the other frames. And by the way the original video is 6sec long with 181 frames. When I open the newly created video, it definitely doesn’t show other frames and it’s not a continuous video
  2. How can I widen the polygon size(coordinates) so that I can create a polygon for the whole frame. As you can see from the above pic, the polygon shape is in white color. I want to enlarge that so that I can detect more objects in the frame

Thanks and I would like to get your support on this.

Hey @Kamal_Moha

The Supervision process_video function provides the video frame and video frame number to the callback, in this case, your process_frame function. Within this function, you should not need to modify the frame variable, yet within your code, you create a Supervision frame generator on every iteration of your video, which returns the first frame.

What that means is this portion of your code is unnecessary and overwrites the frame variable with the first frame, regardless of when it actually is.

    # extract video frame
    generator = sv.get_video_frames_generator(SUBWAY_VIDEO_PATH)
    iterator = iter(generator)
    frame = next(iterator)

It’s not completely clear to me what you are trying to accomplish here, but you can change the polygon size where you declare polygon:

polygon = np.array([
    [44, 22], # Change these to different (x, y) coordinates of the frame
    [52, 710],
    [1288, 714],
    [1292, 22],
    [48, 18]
])