Error in running the demo fine-tune-sam-2.1.ipynb collab notebook by James Gallagher (@rfjames ?)

I tried running the “fine-tune-sam-2.1.ipynb” notebook provided by Roboflow with only one change—I used my own dataset. My dataset has two classes, and I resized the images by stretching them to 1024x1024. The dataset is for an image segmentation task and is exported as SAM2 JSON. Other than that, I didn’t make any modifications.

When I ran the Python training command, I got an error. Since the dataset is the only thing I changed, I think the problem might be related to it. Could it be because the masks of the two classes in my dataset overlap?

I’ve attached screenshot of a sample annotated image from the dataset.

Link to my Roboflow dataset

The only cell changed in the notebook.

!pip install roboflow
from roboflow import Roboflow
import os

rf = Roboflow(api_key="")
project = rf.workspace("toolbox-and-equipment-tracker").project("car-parts-pgo19-vyprg")
version = project.version(1)
dataset = version.download("sam2")
                
# rf = Roboflow(api_key="W8Wh3vwPre13GJ9ArQue")
# project = rf.workspace("brad-dwyer").project("car-parts-pgo19")
# version = project.version(6)
# dataset = version.download("sam2")

# rename dataset.location to "data"
os.rename(dataset.location, "/content/data")

Error msg
INFO 2024-12-30 16:37:15,888 trainer.py: 417: Loading pretrained checkpoint from {‘partial’: True, ‘target’: ‘training.utils.checkpoint_utils.load_state_dict_into_model’, ‘strict’: True, ‘ignore_unexpected_keys’: None, ‘ignore_missing_keys’: None, ‘state_dict’: {‘target’: ‘training.utils.checkpoint_utils.load_checkpoint_and_apply_kernels’, ‘checkpoint_path’: ‘./checkpoints/sam2.1_hiera_base_plus.pt’, ‘ckpt_state_dict_keys’: [‘model’]}}
ERROR 2024-12-30 16:37:21,624 sam2_datasets.py: 63: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py”, line 351, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py”, line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py”, line 52, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/content/sam2/training/dataset/vos_dataset.py”, line 132, in getitem
return self._get_datapoint(idx)
File “/content/sam2/training/dataset/vos_dataset.py”, line 74, in _get_datapoint
datapoint = self.construct(video, sampled_frms_and_objs, segment_loader)
File “/content/sam2/training/dataset/vos_dataset.py”, line 108, in construct
segments[obj_id] is not None
File “/content/sam2/training/dataset/vos_segment_loader.py”, line 247, in getitem
mask = torch.from_numpy(mask_utils.decode([rle])).permute(2, 0, 1)[0]
File “/usr/local/lib/python3.10/dist-packages/pycocotools/mask.py”, line 89, in decode
return _mask.decode(rleObjs)
File “pycocotools/_mask.pyx”, line 149, in pycocotools._mask.decode
ValueError: Invalid RLE mask representation

[rank0]: Traceback (most recent call last):
[rank0]:   File "/content/sam2/training/train.py", line 270, in <module>
[rank0]:     main(args)
[rank0]:   File "/content/sam2/training/train.py", line 240, in main
[rank0]:     single_node_runner(cfg, main_port)
[rank0]:   File "/content/sam2/training/train.py", line 53, in single_node_runner
[rank0]:     single_proc_run(local_rank=0, main_port=main_port, cfg=cfg, world_size=num_proc)
[rank0]:   File "/content/sam2/training/train.py", line 41, in single_proc_run
[rank0]:     trainer.run()
[rank0]:   File "/content/sam2/training/trainer.py", line 515, in run
[rank0]:     self.run_train()
[rank0]:   File "/content/sam2/training/trainer.py", line 532, in run_train
[rank0]:     outs = self.train_epoch(dataloader)
[rank0]:   File "/content/sam2/training/trainer.py", line 740, in train_epoch
[rank0]:     for data_iter, batch in enumerate(train_loader):
[rank0]:   File "/content/sam2/training/dataset/sam2_datasets.py", line 64, in __next__
[rank0]:     raise e
[rank0]:   File "/content/sam2/training/dataset/sam2_datasets.py", line 56, in __next__
[rank0]:     item = next(self._iter_dls[dataset_idx])
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 701, in __next__
[rank0]:     data = self._next_data()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1465, in _next_data
[rank0]:     return self._process_data(data)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1491, in _process_data
[rank0]:     data.reraise()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 715, in reraise
[rank0]:     raise exception
[rank0]: ValueError: Caught ValueError in DataLoader worker process 0.
[rank0]: Original Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
[rank0]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank0]:   File "/content/sam2/training/dataset/vos_dataset.py", line 132, in __getitem__
[rank0]:     return self._get_datapoint(idx)
[rank0]:   File "/content/sam2/training/dataset/vos_dataset.py", line 74, in _get_datapoint
[rank0]:     datapoint = self.construct(video, sampled_frms_and_objs, segment_loader)
[rank0]:   File "/content/sam2/training/dataset/vos_dataset.py", line 108, in construct
[rank0]:     segments[obj_id] is not None
[rank0]:   File "/content/sam2/training/dataset/vos_segment_loader.py", line 247, in __getitem__
[rank0]:     mask = torch.from_numpy(mask_utils.decode([rle])).permute(2, 0, 1)[0]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pycocotools/mask.py", line 89, in decode
[rank0]:     return _mask.decode(rleObjs)
[rank0]:   File "pycocotools/_mask.pyx", line 149, in pycocotools._mask.decode
[rank0]: ValueError: Invalid RLE mask representation

[rank0]:[W1230 16:37:22.752255826 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())