How to output the class label distribution if I export my dataset as YOLO v5 PyTorch format?

So basically I wanted to output the class label distribution according to train/validation/test set. But if using the YOLOv5 PyTorch format, I have no idea how to code it.

Hi @mheadhero, you’d need to write a script to do this count. I probably wouldn’t use the YOLOv5 PyTorch format because then you have to deal with mapping numeric class identifiers back to strings via the labelmap.

The YOLOv5 Oriented Bounding Boxes format is similar but includes the labels in the annotations directly so it makes this task easier.

Here’s an example of doing it using bash with the playing cards dataset from Roboflow Universe.

From the Playing Cards.v1-v1.yolov5-obb/train/labelTxt directory, running this command:

ls | grep .txt | xargs cat | cut -d" " -f9 | sort | uniq -c | sort -nr

Does the following:

  1. List all the files
  2. Find all the .txt files
  3. Output their contents
  4. Split by space & grab the 9th column (the class name)
  5. Sort them so all the same classes are next to each other
  6. Count the number of lines in a row that are the same
  7. Sort them in descending order by the class with the most examples

So for the train set this gives me the following output:

1374 QS
1367 5C
1335 4D
1329 7S
1320 4C
1308 6S
1296 4H
1295 9D
1286 5D
1281 QD
1278 3C
1274 5S
1271 10S
1269 10H
1268 AD
1266 9H
1262 JC
1261 KS
1261 3S
1260 7H
1259 KD
1257 6D
1253 AS
1232 7C
1227 AC
1225 10D
1221 8D
1217 2D
1215 QC
1214 2S
1211 KH
1203 KC
1196 QH
1188 8H
1182 2H
1180 3H
1174 9S
1173 4S
1166 AH
1162 10C
1157 JH
1149 2C
1145 8C
1143 JS
1143 7D
1140 5H
1134 3D
1121 6H
1121 6C
1112 JD
1108 9C
1091 8S

Meaning there are 1374 Queen of Spades, 1367 Five of Clubs, etc.

You can do this in the valid and test label directories as well to get counts for those.

1 Like