After merging several object datasets (already uploaded to Roboflow) into one, I am unable to get a fairly balanced split between classes when generating a new version. After reviewing similar topics, a suggestion to manually move more images from one class into the “VAL” split is not feasible with ~17,500 total images.
Currently, have to export the entire “Merged” Dataset, properly shuffle/splits (e.g. 75/20/5) all objects into new Train/Val/Test splits… THEN upload AGAIN the properly balanced dataset to Roboflow while keeping the existing splits. Have no idea what happens when you add augmentations or if for some reason needed to adjust the balance between Train/Val/Test.
Previously asked about same issue, don’t believe ever received any further update after this response:
If understood correctly, the desired outcome is a merged dataset with classes randomly shuffled among training, validation, and testing. Is that correct?
Yes, almost. There is a bunch of terminology about what we are trying to insure is accomplished but generally speaking we desire a random “sampling” of each class such that there would be a balance of objects in the respective Train,Val,Test splits. Then shuffling those splits so there is no perceived order amongst classes.
E.G.
5 classes of objects in a merged Roboflow dataset - each with 100 images per class totaling 500 total images.
In each of those balanced splits, all of the images would be shuffled.
Hope this makes sense and as you can see from the exported stats above, the splits are not balanced at all and even exclude entire object classes in the Val Split.
You can accomplish this by uploading each class separately so you can choose the split per class. (Or if you upload them already organized in trainvalid and test folders we’ll parse those from the file path and let you keep those choices.
Thanks Brad. We generally upload each class to Roboflow by itself and put all into train so can re-balance later upon export depending upon split-need. Pretty cumbersome to export already uploaded datasets to local machine, shuffle-split, then reload back to Roboflow, then merge, and finally generate new export… don’t you think?. Kinda makes your “re-balance” feature in Generate obsolete. Any chance can incorporate a “shuffle-sample” feature into re-balance dataset? Either behind the scene automatically, which I think everyone would want, or a button-slide feature to equally balance classes upon choosing the re-balance function?
I agree - will this be fixed? The train val split feature is somewhat obsolete right now due to it being so imbalanced. Also, is there a way to re-split images after labeling is finished?
As far as I can tell, they haven’t done anything to address this. Really not sure why the rebalancing wouldn’t be random by default; seems like a silly design choice.