I’m using latest python SDK v1.2.11. I got an error while following the docs instructions: Train a Model | Developer Reference | Roboflow Docs (which is also broken as generate_version needs a settings wrapper). The idea is to generate a new version and trigger train model process on Roboflow GPUs. API key, workspace, and project is correct and the version in generated successfully. When I trigger the whole flow that is shown in the docs above I’m getting roboflow.adapters.rfapi.RoboflowError: {“error”: “Unknown error”} in console and on UI it’s just hand up with initializing . I also checked version via REST API and it also telling me initializing status. Then I tried to use REST API (took it from Python SDK code) start_version_training and got 400 status code with this body {“speed”: “fast”,“modelType”: “yolov11n”,“epochs”: 10}. The same error with rfdetr.
Hi elij!
Thanks for the detailed report. Based on your description, there are a few issues to address:
The version being stuck in “initializing” status indicates that version generation didn’t complete successfully before training was triggered. Training cannot start until the version is fully generated. To fix this you will need to wait until the status changes to generated. You can check the version details in the UI to see if there are any errors with the generation. If it does get stuck, I recommend trying to regenerate that version. I believe this is the reason the 400 error is occurring, as training can’t start until version generation is complete. You’re correct that the docs may be outdated. You will need a settings wrapper for this. I will get some correct info and update those docs as soon as I can, thanks for pointing that out. Let me know if this works out for you, happy Roboflowing!
Thank you for your response, I really appreciate for your prompt reply and suggestions.
I triggered the job via Python SDK yesterday and got the error in console that I described earlier (“unknown error”).
What I see on UI:
Training in Progress
We are fitting a custom model to your data. We estimate Calculating… for this run once it starts. We will send you an email when it is finished.
In the meantime, read our Inference docs to learn how you will be able to deploy your trained model.
Training machine starting…
This lasts for the whole day (almost 24 hours now) and no progress is made.
I also see the dataset and model type with all settings correctly. I can also download dataset w/o any issues.
Based on the code the version generation will be awaited in train stage: roboflow-python/roboflow/core/version.py at main · roboflow/roboflow-python · GitHub
and then it’ll be triggered for training: roboflow-python/roboflow/core/version.py at main · roboflow/roboflow-python · GitHub
AFAIU, version creation is async project.generate_version but when you use it with combination of train you don’t need to await for version generation to complete, right?
I also tried to trigger it again the same version that stuck in “initializing” state, but got the error:
roboflow.adapters.rfapi.RoboflowError: {“error”: {“message”: “A training job is already running for this version.”}}
I triggered new versions via API a few times and even if I go to UI and click on “Stop Training Early“ it just spins with “Stopping“ text for the whole day. When I’m trying to cancel it I got the message “Error cancelling training, please try again.”.
But there’s good news too. I can trigger it via UI with no issues but my use case won’t allow me to use UI that’s why I need API/Python to make this happen.
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.