How to annotate scanned document images to train the PaddleOCR model

insinfo · August 21, 2024, 3:42pm

How to annotate scanned document images to train the PaddleOCR model? I can run PaddleOCR training on my machine with the Total-Text dataset, but now I need to assemble my dataset. I already have the document images, I have already uploaded the images to Roboflow, but I haven’t found any tool that helps annotate scanned documents. Any suggestions or solutions for this?

Another problem is that I can’t type special characters for classes, am I doing something wrong, or am I missing something?

leo · August 22, 2024, 3:36pm

Hey @insinfo

Roboflow’s object detection or instance segmentation project type is not designed for free-text annotation but for class labeling and therefore only supports alphanumeric characters.

It might be worth taking a look at the captioning (text-image pair) dataset type, but that does not allow for localization/object-level annotation

Topic		Replies	Views
To get correct results 🤝 Community Help	7	706	January 23, 2024
Text annotation model? 🤝 Community Help	1	245	October 24, 2023
Labels generated with the API have little information and roboflow doesn't recognize annotations 🤝 Community Help formats , export	6	556	July 1, 2023
Add OCR to object detection 🤝 Community Help	1	317	September 9, 2023
Saving automatically annotated dataset in Pascal VOC XML 🤝 Community Help	0	32	November 12, 2024

How to annotate scanned document images to train the PaddleOCR model

Related topics