I am currently making a detection model that detect birds, to be specific far and small birds (smallest being around 5x5px). I made model from 12k images, run it at 11 epoch with F1 score of 0.35 at 0.06; 100 epoch produced F1 score of 0.46 at 0.036. I also have a 9k dataset, run it at 11 epoch, it produced an F1 score of 0.91 at 0.337. Upon observation, the model form the 12k dataset sometimes works on much further birds than the 9k dataset model.
What metric should I focus on in order to determine which model is better for my use?