Best Practices for Annotating Surgical Instrument Data for YOLOv12n + Inference Class Restriction

Hi Roboflow community,

I’m working on a computer vision project for surgical instrument detection during operations. I’ve uploaded and annotated a dataset here:
My Surgical Instrument Detection Dataset

I exported the dataset in YOLOv12 format and trained a YOLOv12n model on Kaggle for 100 epochs. However, during inference, I noticed that the model sometimes predicts non-existent classes like “person” or “car” — which are part of the COCO dataset, but not in mine.

Here are some questions I’m struggling with:

  • Annotation Strategy: Should surgical instruments be labeled individually (scalpel, forceps, etc.), or can they be grouped into one general class like “instrument”?
  • Annotation Considerations: Any specific tips for labeling surgical data where lighting, glare, motion blur, and occlusion are common?
  • Inference Issue: How can I restrict inference to only my custom classes and avoid default YOLO classes showing up?

Any advice or similar project experiences would be greatly appreciated.

Thanks!

Hi @Tixtor710,

Could you try training the model on Roboflow and use inference.roboflow.com to do detections to see if you run into the same inference issue? We can’t debug models trained outside Roboflow.

Both annotation questions are pretty specific to the task you are trying to accomplish. Can you tell us more?

Hey Tixtor710. I dropped your model into a Roboflow workflow quick. I couldn’t replicate the extra classes, but one nice thing about that platform is you can see the output and check confidence levels. Maybe you would find a threshold where those irrelevant classes are minimized. Here’s what the first output looked like:

And another feature in the Workflows is that you can list the classes you want to generate predictions for. That could help things as well. But maybe with some additional info you provide people will be able to give some more ideas/suggestions.

In general I have better luck annotating items individually when there is any sort of variation in order to reduce generalization during training. If you put them all in an “instrument” class you might get some false positives on similar looking objects in the images, especially for ones where you are zoomed out in the operating room. Blur - that’s just tough. Certainly you can do augmentations to create sample blurry images but in the long run it’s better to get higher quality images for the model to evaluate.

2 Likes