Trackers fail compared to frame-by-frame Keypoint Detection

How do I tune trackers for video inference?

My Keypoint Detection model reached the following metrics:

  • mAP@50: 96.6%
  • Precision: 100%
  • Recall: 90.9%
  • F1: 95.2%

And it really shows good quality keypoint detection on a bunch of frames I’ve provided.

But when it comes to the video inference, model seems to get mad. Standard code from documentation doesn’t contain any tracking at all. So I’ve tried to add trackers:

  • OC-SORT doesn’t outperform any other tracker
  • SORTTracker also gives insufficient results
  • DetectionSmoother changed almost nothing in the way bbox coordinates are calculated
  • ByteTrack often loses track of an object and produces hundreds of IDs (only ~2 needed)

All of the them often ‘lose’ backward objects (probably because of incorrect IoU threshold).

I’ve added hard poses, frames with vertices missing, like GPT on Roboflow Documentation advised, but it made almost no improvement, comparing to frame-by-frame inference.

I’ve also tried to tune ByteTrack:

tracker = sv.ByteTrack(

       lost_track_buffer=75,

       track_activation_threshold=0.18,

       minimum_matching_threshold=0.7,

       minimum_consecutive_frames=1,

)

But the suggested param values seems to be bad.

You can compare the inference quality on the image:

How should I tune this trackers or change video inference module to get better results?