I have found resources in order to perform zero-shot inference of most objects using Grounding DINO.
How do I fine-tune this model? For instance, I want to use Grounding DINO to look at an image of a shelf, find how many racks are present, and what are the products present in this rack? Grounding DINO is either not innately trained to recognize these specific products such as clothes, toys, SKUs etc, or does not do well in most cases.
I am seeking a script to fine-tune the base G-DINO model, so that it can recognize these specific objects. I can get a labelled dataset in whatever format is required for training.