Strategy for finetuning on large dataset

Hello everyone, i am struggling with tranining my car detection model.

Dataset contains crossroads (about 60, but many with <10 pictures), with labeled cars in coco format. Firstly there was 11 classes, but in new annotations we added 2 extra classes, so my training divided by half (with old annotations, and with new).

First, I trained model RFDETRMedium on half of my dataset (full ds - 32k, half - 16k) from pretrain weights from the box. I used default parameters (lr=1e-4, resolution 576x576 and etc).
I trained with two parts, first with 11 classes (150 epochs), second with adding 2 classes (~100 epochs).
And it was trained well: mAP50 ~= 0.87 on 13 classes, train loss ~= 3.87, test loss ~= 4.38.
Also, I was training this part on GPU A100 (with batch_size=16, grad_accum_steps=1)

mAP50 graphic - https://ibb.co/mFJJGGX1

Then I wanted to train on full dataset, so i saved my images/labels from validation to new dataset and added second half of my full dataset. I have set this params

import os
os.environ[‘MASTER_ADDR’]=‘localhost’
os.environ[‘MASTER_PORT’]=‘5678’
os.environ[‘WORLD_SIZE’]=‘1’
os.environ[‘RANK’]=‘0’
os.environ[‘LOCAL_RANK’]=‘0’

from rfdetr import RFDETRMedium

model = RFDETRMedium(pretrain_weights=‘inference/checkpoint_best_total.pth’)

model.train(
dataset_dir=‘data/new_car_ds’,
epochs=50,
batch_size=4,
resolution=768,
grad_accum_steps=4,
lr=2e-5,
output_dir=‘finetune_04.03.2026’,
checkpoint_interval=1,
)

and best checkpoint from my last training. In this part of training I used GPU A30. But now mertrics is not that good.

mAP50 graphic:

Test_loss graphic - now test loss hosted at ImgBB — ImgBB

So i trained my model for 23 epochs, and it doesn’t study like before. I also tried to train my model with lr=1e-4, but it was not good too.

I think the problem is that first dataset was very easy (low amount of small objects), but in new version there was added many hard crossroads. Maybe I need to configure my hyperparameters smarter? Is there some advices that you can give me on this task?

firstly , I need you to read this summarize and Tell me if it reflects your problem right or not :

if yes then I can explain for you why your model degrades in the performance
and I need to know if you have the Basics / Fundamentals of Machine Learning and Deep Learning or not please.

yes, this is accurate summary of my problem. And yes, i have a middle understanding of ML and DL.

1 Like

Since you have experience and knowledge in machine learning and deep learning, you should first check whether there is a difference between the data distribution in the first experiment and the second experiment. The configurations used in the two experiments were clearly different.

Let’s highlight the differences. I do not remember all the fine details of the first and second experiments, but they were written in the summary you shared earlier. However, if you compare the two experiments side by side, you will notice several differences. For example, the batch size was different. The learning rate was also different, as well as the number of epochs. As far as I remember, the first experiment was trained for around 150–250 epochs, while the second experiment was trained for only 50 epochs. So we are clearly dealing with very different training configurations.

The second important point is related to best practices. When we use a foundation model or a pre-trained model, it has already been trained on very large and diverse datasets containing many types of samples and objects. As a result, the learned weights are generally capable of adapting well to smaller datasets for specific downstream tasks.

However, what you did is not considered a good practice. You trained the model and generated a checkpoint—let’s assume it was the best checkpoint obtained after training for around 150–250 epochs. Then, after adding new classes and modifying the original dataset, you took that checkpoint and tried to fine-tune it again using different settings, such as a different learning rate, batch size, and other parameters.

Instead, you should start again from the original pre-trained model with its initial weights, and then train or fine-tune it directly on the second dataset. This approach allows the model to better generalize to the new task.

You can think about this process similarly to knowledge distillation: it is like taking a scientist who has a broad PhD-level knowledge in general science and specializing them further in a very specific topic related to your task. When done correctly, you will likely see better results.

If you want to further optimize the training process, you can use hyperparameter optimization algorithms. Some well-known approaches include:

Genetic algorithms

Bayesian optimization

If you are using frameworks such as PyTorch or Ultralytics/RT-DETR, you can integrate external libraries for hyperparameter tuning. One popular framework is Ray Tune, which supports Bayesian optimization and other search strategies. You should check how to integrate Ray Tune with your model training pipeline.

For example, if you have a good virtual machine or a server with a powerful GPU—and sufficient time and budget (especially if you are using cloud infrastructure)—you can run multiple hyperparameter trials.

To summarize the recommended workflow:

1. Start from the original pre-trained model weights.

2. Apply a hyperparameter optimization algorithm such as genetic algorithms or Bayesian optimization.

3. You can implement Bayesian optimization using a library such as Ray Tune.

4. Run multiple trials (for example, 30 trials), where each trial trains the model for a limited number of epochs (for example, 30 epochs).

5. Define the search space by specifying:

Initial parameter values

Maximum values

Number of trials

Number of iterations

6. Let the framework search for the best hyperparameter configuration.

7. After finding the optimal configuration, perform a focused training run using those parameters.

8. Finally, compare the results with both the first and second experiments.

This pipeline will allow you to properly evaluate whether the difference in results is due to data distribution changes or training configuration differences.

If you need any help implementing this process or integrating the optimization framework, feel free to contact me after this message.

So this is the main problem you were talking about the difference between the two data distributions making your model in the second experiment give you low results so apply the previous pipeline I have said to you okay and you will find a better results

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.