I’m using RF-DETR Segmentation to track a dense group of sheep entering a barn (lots of occlusions + very similar-looking individuals) and I need stable per-animal trajectories with minimal ID switches.
Right now I’m running ByteTrack via Supervision, but after checking SkalskiP’s latest projects I noticed he often uses SAM/SAM2 tracking, and in practice it seems to work really well! (e.g. in his recent basketball project he uses RF-DETR for the initial detections and then SAM2 to handle the tracking). I’ve also been reading that SAM/SAM2 tracker can outperform more classic trackers in very dense situations.
In your experience, which approach works best with RF-DETR Seg when you care about identity-stable tracking in crowded scenes: ByteTrack, a SAM/SAM2-based tracker, and are there any key settings or reference pipelines you’d recommend?
Also, how does a SAM2-style tracker handle the case where not all objects are present in the first frame (e.g., some sheep enter later): do you periodically re-initialize with new detections, or is there a standard way to add new tracks on the fly?
Great question! This is a nuanced tradeoff, and as it usually goes in such cases, the “right” answer depends on your latency requirements and how severe the occlusions are.
Let me list below a breakdown of considerations between ByteTrack vs SAM2-Based Tracking
Factor
ByteTrack
SAM2 Video Predictor
ID stability through occlusion
Moderate — relies on Kalman + IoU, struggles when sheep overlap heavily
Strong — memory bank maintains appearance features across occlusions
Visually similar objects
Weak — no appearance model by default
Better — learns per-object embeddings
Speed
Fast (~real-time)
Slower — memory propagation has overhead
New object handling
Native — just match new detections
Requires explicit re-initialization
Given sheep entering a barn (continuous entry, heavy occlusion at the doorway):
Use SAM2 as primary tracker for ID stability
Run RF-DETR detection frequently near the entry zone (maybe every 5-10 frames) to catch new sheep
Run detection less frequently once sheep are inside and tracked
Consider a spatial prior — if you know where the door is, only look for new tracks in that region
If latency is critical and you need real-time, you could also look at BoT-SORT (ByteTrack + appearance features) as a middle ground — it adds ReID embeddings to ByteTrack without the full SAM2 overhead.
Note that SAM2’s video predictor does not auto-discover new objects. It only propagates masks for objects you explicitly initialize. So yes, you need to periodically re-run detection and add new tracks manually.