Hello,
so I am currently in the process of vision fine tuning the GPT-4o model. Regarding this How-To Blog (How to Fine-Tune GPT-4o for Object Detection) it is not clear why this approach is used
. Specifically I do not understand the multiplication by 1024. Unless I missed it, it is neither explained in the video nor the blog. My assumption would be the images are formatted to 1024x1024 internally? Are there any sources on this? Also regarding the jsonl, wouldn’t we have to mention the steps we are taking (normalization followed by * 1024) in order for the system to learn? Or is it able to deduce these steps. It is quite unclear for me and I am fine tuning for my thesis. The roboflow app is very helpful however I need a scientific bases on why we take these steps