Preprocessing can change the number of images and classes per version. In other words, filtering by a tag can drop images completely and modifying classes can drop a class completely. When the train/test split is performed before preprocessing, this can result in a fairly dramatic shift in balance. E.g. what started as 15% validation split may end up only being 7% after images are dropped if more of the randomly chosen images are dropped from the validation bucket. The same could happen in the other direction.
Compounding this, some classes may end up being skewed, both by virtue of there being more items in that class and by virtue of random probability dropping more of those classes due to tag or class filters. Even if the split continues to occur before preprocessing, I think it would be better to split by class than by sheer number of images.
As an example, one dataset I have started at ~7,300 images. After preprocessing, ~3,000 images were dropped, leaving ~4,300. The original validation split was set to 14%, and after preprocessing, I had 10% of images in the validation set. The table below breaks things down in a better way though. Some classes, by virtue of their quantity, have well over 10%. Other’s have nearly nothing (and did have nothing in a previous release, before I increased the Train/Test split). I don’t feel like this makes a very good validation set and would rather Roboflow aim to get X% of each class, rather than X% of all images.
Thanks for bringing up this issue! I’ve escalated this feature request to our team.
It sounds like either splitting by classes or allowing rebalancing after preprocessing would address this problem. If we were to allow splitting by classes, where would you expect to set this?
It sounds like either splitting by classes or allowing rebalancing after preprocessing would address this problem
I think both. Splitting by classes would make for better test/validation sets either way. Rebalancing (or just the initial balancing) after preprocessing would ensure that the percentages assigned to each remain more consistant. In other words, swap steps 2 and 3, with the numbers in 3 (now Train/Test Split) reflecting any filters applied in 2.
If we were to allow splitting by classes, where would you expect to set this?
My first thought would be to add a checkbox, checked by default, to the Train/Test Split step that says something like “Split by Class”. Unchecking it would revert back to just selecting a random X% of the images, not taking class into account. There might still be a use case for sampling by images if you want a truly random test group, thus the option. While truly random might be more representative of whatever dataset someone is working on though, I don’t think it makes for a good training model where you want at least some representatives of even low count classes. I’d have to defer to someone with more experience on training models than me. If there isn’t a reason to keep that as an option, then maybe nothing has to change except the back-end behavior. Users still see the same re-balance slider, only it takes into account classes now.
I have a similar issue. I am a paying customer, and am trying to train models on Roboflow with a dataset that is similarly skewed strongly, with classes landing almost fully in training or validation sets. In my case, the rebalancing (for whatever reason) decides to create a validation set that is almost completely one class. And that class is one of the less common classes (of the 7 total classes I am using). So I am training a model, then evaluating it on images that exclusively have a class that the model has never even seen. This means that my mAP is very low, and I am not sure that it creates a model I can even use. I want to train a model to use for autolabeling, but if this broken/un-changeable rebalancing doesn’t allow me to train a model for autolabeling, I wont be able to continue to use RoboFlow as my company scales, which means I may have to switch to a different platform altogether. I really like RoboFlow in general, but this is a major problem to allow me to scale my companies labeling efforts.