RF-DETRTraining Error

I tried training RF-DETR multiple times and keep running into the same issue. “This training job did not complete successfully. This can happen for a few reasons but often means that the chosen model dimension (which corresponds to image size) was too large to fit into GPU memory.” I have tried multiple input sizes (including 640 and 560) and nothing seems to be working.

Hey @mz0g

Thanks for trying RF-DETR. It’s a pretty new model and so we’re actively making improvements to make that experience better. We’ve released several fixes that should prevent this issue from reoccurring and have refunded credits to those that were charged from crashed training runs.

If you have any other feedback or issues, please let us know!

Hello!
I have faced a similar issue with training RF-DETR. I have been searching for relevant help and finally found this chat. I have attached the image proof of the issue. I tried to train it around last week on small plant dataset. The credits for the image augmentations, version generation and Model training have been consumed but the error message says model stalled and no result. I am currently on Student Research plan that I use for my Honors Project. My April month credits are all used up and I’m not able to make further progress in my project. It is affecting my work. Please Please Kindly respond.

Hey Leo, thanks for the update on this! Really appreciate the update! How can I check if my credits were refunded? I think also a good feature to incorporate would be a list of model runs that occurred with the number of credits that were consumed. I tried training rf-detr 3 times but I can’t really see how it impacted my overall training credits.

Hey @mz0g and @MYAKAM_ANIRUDH_IIITK

We manually refunded credits for people who had stalled trainings - training jobs that didn’t finish. I miscommunicated in my earlier post, and I’m sorry about that.

For failed trainings, our billing system doesn’t deduct credits for failed training jobs in the first place, so there should be nothing to refund.

I think also a good feature to incorporate would be a list of model runs that occurred with the number of credits that were consumed.

That’s good feedback. We’re always working on making usage clearer for our users, so I’ll pass that to the team working on that front.

Hey @MYAKAM_ANIRUDH_IIITK I also wanted to let you know that your run crashed due to numerical instability on our side. We have pushed a fix for it, please let us know if the issue repeats again!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.