Training Failed

Victor_Manuel_Opazo_Troncoso · October 17, 2023, 12:43am

Hello, we are testing your platform and we got the following error while doing the training:

CUDA out of memory. Tried to allocate 7.96 GiB (GPU 0; 21.99 GiB total capacity; 12.54 GiB already allocated; 4.94 GiB free; 16.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/app/run_and_catch_error.py", line 11, in <module>
    runpy._run_module_as_main(args.module)
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/app/yolov8_object_detection_config.py", line 10, in <module>
    main()
  File "/app/yolov8_object_detection_config.py", line 6, in main
    trainer.monitored_train()
  File "/app/src/abstract_monitored_trainer.py", line 34, in monitored_train
    raise self.exc
  File "/app/src/abstract_monitored_trainer.py", line 40, in monitor_train
    self.train()
  File "/app/src/yolov8/base.py", line 286, in train
    self.model.train(
  File "/usr/local/lib/python3.8/dist-packages/ultralytics/yolo/engine/model.py", line 373, in train
    self.trainer.train()
  File "/usr/local/lib/python3.8/dist-packages/ultralytics/yolo/engine/trainer.py", line 192, in train
    self._do_train(world_size)
  File "/usr/local/lib/python3.8/dist-packages/ultralytics/yolo/engine/trainer.py", line 332, in _do_train
    self.loss, self.loss_items = self.model(batch)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ultralytics/nn/tasks.py", line 44, in forward
    return self.loss(x, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ultralytics/nn/tasks.py", line 215, in loss
    return self.criterion(preds, batch)
  File "/usr/local/lib/python3.8/dist-packages/ultralytics/yolo/utils/loss.py", line 179, in __call__
    _, target_bboxes, target_scores, fg_mask, _ = self.assigner(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ultralytics/yolo/utils/tal.py", line 112, in forward
    mask_pos, align_metric, overlaps = self.get_pos_mask(pd_scores, pd_bboxes, gt_labels, gt_bboxes, anc_points,
  File "/usr/local/lib/python3.8/dist-packages/ultralytics/yolo/utils/tal.py", line 131, in get_pos_mask
    mask_in_gts = select_candidates_in_gts(anc_points, gt_bboxes)
  File "/usr/local/lib/python3.8/dist-packages/ultralytics/yolo/utils/tal.py", line 24, in select_candidates_in_gts
    bbox_deltas = torch.cat((xy_centers[None] - lt, rb - xy_centers[None]), dim=2).view(bs, n_boxes, n_anchors, -1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.96 GiB (GPU 0; 21.99 GiB total capacity; 12.54 GiB already allocated; 4.94 GiB free; 16.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Lenny · October 17, 2023, 12:45am

The error message you’re seeing indicates that your GPU ran out of memory during the training process. This can happen if the model you’re training is too large or the batch size is too high for the GPU memory available.

Here are a few suggestions to resolve this issue:

Reduce the batch size: Lowering the batch size can help reduce the memory requirement. However, this might also affect the model’s performance and increase the training time.
Use gradient accumulation: This technique allows you to effectively train with larger batch sizes without increasing the memory requirement. However, it might not be directly supported by all training scripts and might require some code modification.
Use a GPU with more memory: If possible, you can switch to a GPU with more memory. This would allow you to train larger models and use larger batch sizes.
Optimize your model: If your model is too large, you might want to consider using a smaller model or applying some model optimization techniques.

Unfortunately, the extracts provided do not contain specific instructions or code to resolve this issue in the context of Roboflow. For more detailed guidance, you might want to refer to the PyTorch documentation on Memory Management and PYTORCH_CUDA_ALLOC_CONF.

If the problem persists, it would be helpful to share more details about your training setup, such as the model you’re using, the batch size, and the specifications of your GPU.

Hi! I’m Roboflow’s AI powered bot. This is an automated response powered using AI with knowledge from our docs, blog, knowledge base, and previous forum responses. If you find this to be a helpful answer, please mark it as the solution using the checkbox icon below. If this doesn’t solve your issue, please follow up with what you’ve tried/why, and the community will continue to chime in as usual.

Victor_Manuel_Opazo_Troncoso · October 18, 2023, 11:47am

Dear, we are evaluating your platform for generating prediction models with 2024 images of size 640x640 and we have errors in the training due to lack of memory on your GPU, is there any possibility that you can help us with the training of our data set? We are currently evaluating your platform for an upcoming service contract.

Topic		Replies	Views
Error When training YOLOv7 on own dataset Community Help	0	336	October 23, 2022
This training job did not complete successfully Feedback	0	18	March 9, 2025
YoloV11 training failed Community Help	4	92	March 25, 2025
CUDA Out Of Memory when training a DETR Object detection model with compute_metrics Community Help bugs	1	71	November 30, 2024
Epoch got killed after first iteration Community Help	7	318	March 29, 2022

Training Failed

Related topics