How large of a dataset is necessary to fine-tune something like SAM?

Of course it will vary depending on scenario, but would be very helpful to get some ballpark numbers. What order of magnitude have you seen be successful?

Hi @ansonkao

This is a good question, I had this question too after creating my first model.

To answer your question I think you’re already on the right track, it does in fact vary depending on the scenario. Most of the case-ready models you’ll find on Roboflow Universe: with high accuracy tend to sit at a total image dataset anywhere from 1,000 to 10,000 and I’ll explain why.

Like you said, depending on the scenario or task you may train a model on different amounts of images.

For small, not-so-complex projects

  • you may be able to get away with training off a few hundred well-labeled and annotated images for the model to have a good turnout.
  • An example of this could be distinct objects against a uniform background.

For moderate to complex tasks

  • a thousand to thousands of images should be sufficient.
  • An example of this might be multiple types of objects in various light settings.

Finally for complex tasks

  • tens of thousands to hundreds of thousands.
  • Generally, projects for consumers won’t reach this size however I think it should still be mentioned. Tasks requiring such an intense data size might include cityscape segmentation, determining pedestrians, buildings, trees, roads, signs, and more all in the same scene.

Other important things to mention

  • Image quantity is not the only defining factor. Labeling and annotation are highly crucial as this separates the data from the quality of data, which is arguably more important.

  • This documentation should help clear things up with SAM: Enhanced Smart Polygon with SAM - Roboflow Docs

  • It is also important when dealing with large and flexible models to improve model performance by adding Augmented Images: Generate Augmented Images - Roboflow Docs

  • Another thing to note is performing data health checks where you can derive a range of insights about your dataset. : Health Check - Roboflow Docs

Lastly, I think it is worth mentioning that with larger datasets, labeling, and annotating can become quite cumbersome and repetitive. To better and more efficiently achieve this some might consider creating a model on a few hundred images to help speed up the annotation process of larger datasets containing thousands more using Roboflow’s Label Assist. This documentation will help you there: Model-Assisted Labeling - Roboflow Docs

Hope this helps!