Best tracker to pair with RF-DETR Seg for dense scenes?

Hi all!

I’m using RF-DETR Segmentation to track a dense group of sheep entering a barn (lots of occlusions + very similar-looking individuals) and I need stable per-animal trajectories with minimal ID switches.

Right now I’m running ByteTrack via Supervision, but after checking SkalskiP’s latest projects I noticed he often uses SAM/SAM2 tracking, and in practice it seems to work really well! (e.g. in his recent basketball project he uses RF-DETR for the initial detections and then SAM2 to handle the tracking). I’ve also been reading that SAM/SAM2 tracker can outperform more classic trackers in very dense situations.

In your experience, which approach works best with RF-DETR Seg when you care about identity-stable tracking in crowded scenes: ByteTrack, a SAM/SAM2-based tracker, and are there any key settings or reference pipelines you’d recommend?

Also, how does a SAM2-style tracker handle the case where not all objects are present in the first frame (e.g., some sheep enter later): do you periodically re-initialize with new detections, or is there a standard way to add new tracks on the fly?

Great question! This is a nuanced tradeoff, and as it usually goes in such cases, the “right” answer depends on your latency requirements and how severe the occlusions are.

Let me list below a breakdown of considerations between ByteTrack vs SAM2-Based Tracking

Factor ByteTrack SAM2 Video Predictor
ID stability through occlusion Moderate — relies on Kalman + IoU, struggles when sheep overlap heavily Strong — memory bank maintains appearance features across occlusions
Visually similar objects Weak — no appearance model by default Better — learns per-object embeddings
Speed Fast (~real-time) Slower — memory propagation has overhead
New object handling Native — just match new detections Requires explicit re-initialization

Given sheep entering a barn (continuous entry, heavy occlusion at the doorway):

  1. Use SAM2 as primary tracker for ID stability

  2. Run RF-DETR detection frequently near the entry zone (maybe every 5-10 frames) to catch new sheep

  3. Run detection less frequently once sheep are inside and tracked

  4. Consider a spatial prior — if you know where the door is, only look for new tracks in that region

If latency is critical and you need real-time, you could also look at BoT-SORT (ByteTrack + appearance features) as a middle ground — it adds ReID embeddings to ByteTrack without the full SAM2 overhead.

Note that SAM2’s video predictor does not auto-discover new objects. It only propagates masks for objects you explicitly initialize. So yes, you need to periodically re-run detection and add new tracks manually.

We have some nifty models in Roboflow Universe you may want to check out as well: https://universe.roboflow.com/riis/aerial-sheep

let me know if this was helpful!

Bar Shimshon

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.