Are these metrics ever a good indicator to bias and variance? Or would they essentially always give falsely high results if the model has unrepresentative data sets?
The metrics are reported on your validation and testing sets so they are only as good as your dataset.
If it’s representative of what your model will see in the wild they’ll be a good indicator of your model’s performance. But if they’re not representative, you’ll be flying blind and your model may perform worse on images unlike those it has been trained on.