After Merging Datasets, Re-balancing (Train/Val/Test) excludes multiple classes in VAL split

After merging several object datasets (already uploaded to Roboflow) into one, I am unable to get a fairly balanced split between classes when generating a new version. After reviewing similar topics, a suggestion to manually move more images from one class into the “VAL” split is not feasible with ~17,500 total images.

Currently, have to export the entire “Merged” Dataset, properly shuffle/splits (e.g. 75/20/5) all objects into new Train/Val/Test splits… THEN upload AGAIN the properly balanced dataset to Roboflow while keeping the existing splits. Have no idea what happens when you add augmentations or if for some reason needed to adjust the balance between Train/Val/Test.

Previously asked about same issue, don’t believe ever received any further update after this response:

Please see below export stats and screenshots:

Train Object Stats:
5347 handgun
4648 rifle
4748 knife
1437 hammer

Val Object Stats :
0 handgun
0 rifle
995 knife
2587 hammer


2 Likes

Thanks for the note here.

If understood correctly, the desired outcome is a merged dataset with classes randomly shuffled among training, validation, and testing. Is that correct?

Yes, almost. There is a bunch of terminology about what we are trying to insure is accomplished but generally speaking we desire a random “sampling” of each class such that there would be a balance of objects in the respective Train,Val,Test splits. Then shuffling those splits so there is no perceived order amongst classes.

E.G.

5 classes of objects in a merged Roboflow dataset - each with 100 images per class totaling 500 total images.

Tiger - 100 images
Lion -100 images
Elephant -100 images
Zebra - 100 images
Monkey -100 images

Seek to split 500 images into Train/Val/Test with 80/15/5 ratio.

OUTCOME WOULD BE:

Train:
Tiger 80
Lion 80
Elephant 80
Zebra 80
Monkey 80

Val:

Tiger 15
Lion 15
Elephant 15
Zebra 15
Monkey 15

Test:

Tiger 5
Lion 5
Elephant 5
Zebra 5
Monkey 5

In each of those balanced splits, all of the images would be shuffled.

Hope this makes sense and as you can see from the exported stats above, the splits are not balanced at all and even exclude entire object classes in the Val Split.

Thanks.

Any update?

You can accomplish this by uploading each class separately so you can choose the split per class. (Or if you upload them already organized in train valid and test folders we’ll parse those from the file path and let you keep those choices.

Thanks Brad. We generally upload each class to Roboflow by itself and put all into train so can re-balance later upon export depending upon split-need. Pretty cumbersome to export already uploaded datasets to local machine, shuffle-split, then reload back to Roboflow, then merge, and finally generate new export… don’t you think?. Kinda makes your “re-balance” feature in Generate obsolete. Any chance can incorporate a “shuffle-sample” feature into re-balance dataset? Either behind the scene automatically, which I think everyone would want, or a button-slide feature to equally balance classes upon choosing the re-balance function?

Please advise and thanks.
Sean

I agree - will this be fixed? The train val split feature is somewhat obsolete right now due to it being so imbalanced. Also, is there a way to re-split images after labeling is finished?

As far as I can tell, they haven’t done anything to address this. Really not sure why the rebalancing wouldn’t be random by default; seems like a silly design choice.