What method is used in roboflow to perform the train, valid, and test split on the dataset?

I see there is no mentioned of what method is used in roboflow to perform the train, valid, and test split on the dataset? Anyone have idea?

Hi,

Your train/valid/test split is set when assigning the images for labeling during the initial dataset upload (you can toggle the settings to set your split during upload).

You can also change your split when generating a new dataset version. Just note that augmenting your images will rebalance the train/valid/test split, as augmented images will be generated in your training set.

I mean like, if we split it manually there methods such as randomly split/ stratified split. What about roboflow? What method it used to split dataset? How do I know if it split randomly to secure no bias?

I can say that we have it written to split randomly

Where did you guys write this statement? I actually did search thoroughly maybe not its not that thorough tho.

I’m not sure that it’s written explicitly, but I can attest that it is randomized in our system and duplicate image uploads are also left out to reduce the chances of train/test bleed.

1 Like

Hi after talking to the team and digging deeper for confirmation on how the system works, this is what I learned:

It is still random which images we move (but we try to do it deterministically so that if you go from 70/20/10 to 70/10/20 then back to 70/20/10 you end up with the same 70/20/10 as you started with)

Notably, you can set your own train test split for specific batches of images on upload too: How to Create a Train Test Split