Training never finishes, is stuck forever

I’ve got a training job that is stuck forever. Not sure if it’s actually wasting GPUs. Anyone know how to fix this. If you work at Roboflow the job is in the cam-xwerh account and it’s v11 of the only project.

Based on the information provided in the extracts, it seems that you can cancel a training job if it appears to be stuck. As per the Roboflow’s 12 Days of Shipmas: 2023 blog post, if you cancel a training job within 30 minutes of starting the job, your training credit will be refunded.

However, if the training job has been running for a while and appears to be stalled, it might be due to the model compiling final statistics like performance on the validation and testing set. This may give the appearance of it being stalled even though the model is continuing its work.

If the issue persists, it would be best to reach out to Roboflow support. You can share your workspace with the Roboflow Support team as suggested in this forum post. Here is the documentation on how to share a workspace with Roboflow support.

If you believe there might be an issue with your dataset or model, you could also consider reviewing the annotations, labeling, or any potential areas for bugs, and reinitiate the training from the checkpoint as suggested in this forum post. Here is the documentation on how to train from a Universe Checkpoint.

Please note that this advice is based on the information available in the provided extracts and may not fully resolve your issue. If the problem persists, I recommend reaching out to Roboflow support directly for further assistance.

@Cam Thanks for the report - Apologies for the troubles. We’re taking a look. Is this a training job you canceled?

