Need help understanding a few things (Large Data Sets)

So my issue is that I need to train a model on a large dataset of satellite images of roofs, similar to this: Building Image Dataset and Pre-Trained Model by Ayush Wattal Ritu Bhamrah Ritanjali Jena Sai Kumar Sannidhi
however, I want to incorporate automation in this, as I’ll need thousands of images as well as shadow detection to account for overlapping buildings.

So for now my only question for the community is that I need some tips and tricks for training large datasets, any tips on collecting large datasets? On my current plan, I only have 200 images to train a model on correct? I’ve seen other users use thousands of images and I’m sure that’s gonna have to be me. How do I do that?

Or if you have any other tips that could be helpful to a beginner all is welcomed!

Thank you!

Hi @Alejandro_Bermudez

Sounds like an interesting project. Some things that I could suggest are:

  • Use Label Assist. It can help you significantly speed up annotation time by using your model to train other images.
    • I recommend you label a smaller batch, train a model, then use it to label your other images.
  • Sometimes, you don’t need that much data. As I mentioned earlier, try incrementally training your dataset to see if you see your desired performance.
    • Check out this blog post from our team on how you can reduce your dataset size.
  • Invite your friends. The public plan allows you to invite up to three users so they can help you label data faster.

If you’d like details/tips on specific aspects or have any questions , feel free to follow up or create a new topic.

1 Like

Thank you for the advice! I think my problem is going to need instance segmentation because I can’t use border boxes, I would need the exact dimensions of any given roof.

I read all the documentation you sent me, which is really cool especially that I can use a model that’s already better trained and can be found in Roboflow Universe, but what if the model I need has to already be very good and isn’t on Roboflow Universe, for example, I think I will need instance segmentation because bounding boxes are not accurate for detecting area (more on that later) and to will need likely over 2500 pictures for a solid mAP, precision, and recall. And I’ve built small Roboflow models of like 30 pictures and they perform as expected (very poorly) Any tips for this?

Determining area via Roboflow model isn’t possible however, if we use a standardized distance measurement from Google Maps we can convert pixel count to feet and from there determine a rough estimate of sq/f

Hi @Alejandro_Bermudez

Do you mean you have an external model you’ve found that you’d like to use?

I think it might be good to look into solutions like Autodistill from our team which uses large foundational models from SAM, CLIP, and Grounding DINO to label and train models.

Currently I have found no modes that support my exact needs, most available Roof models are object detection and my needs are instance segmentation.

So I’ll likely be making it myself. Instance segmentation, label assist and SAM, any other tips you’d suggest? I’m going to need to create my own dataset if google doesn’t have one that mixes well formatting-wise with Roboflow. I have a few ways that I can think of to generate mass samples of google maps static images however it might put a halt on my api key for unusual amounts of use. However I’m sure there’s better ways of creating datasets?


Hi @Alejandro_Bermudez

We’re looking to improve on our Universe search, but if you search instance segmentation roof by metadata on Universe, there are a couple datasets that might be good for your use case.

That is true, these two might be very useful:

Especially the second one, they may not be tailored but the second one has a large dataset of properly annotated images. I wont need 1 of the two classes, but the roof class would be useful. Is there a way you know of that I can run the model on images but only looking for a single class? rather than simply gathering the data from a single class in the returned predictions?

I appreciate the help thus far!

1 Like

Hi @Alejandro_Bermudez

Yes, there is a way to do that. You can filter the classes returned from predictions depending on how you use your model. On the hosted inference API, you can set a classes query config, where you can input a comma-separated list of classes you only want it to return. Learn more on the hosted inference docs.