How to Convert VOC Dataset for Yolo5
a step by step introduction with Cervical cells.
Our cervical cell dataset is Pascal VOC format. Yolo doesn't support Yolo5, we have to convert the dataset to Yolo5 format from Pascal VOC:
- Let's download the covert tools code from Github, the Pascal VOC dataset directories tree will be looked like:
(base) $ tree . -L 1
.
├── Format.py
├── README.md
├── __pycache__
├── cevical_voc -> /Volumes/Bo500G32MCache/baidu_disk/0729/
├── img
├── xml
└── names.txt
├── compare_file_images_vs_lables.py
├── example
├── example.py
├── images
├── label_visualization.py
├── manifest.txt
├── msgLogInfo.py
├── output
├── requirements.txt
├── tmp1.py
├── yolo5_cer
└── yolo5_cer.tgz
xml: labels
img: images
file list: name.txt.
- Each image will have its original img file, like .tiff, . png, .jpg, etc.
- Each image will also have the labels file, which has the same file name except comes with different suffixes.
- The class file includes:
1) class name,
2) the type value counted from 0, 1, 2…
2. Run the following command to convert VOC to YOLO5:
python3 example.py --datasets VOC --img_path ./cevical_voc/img/ --label ./cevical_voc/xml --convert_output_path ./output --img_type ".jpg" --cls_list_file ./cevical_voc/names.txt
After running the above command, the ./output folder will have the converted labels:
output
├── 1903610_b20200727_m0208360849_IMG001x010.txt
├── 1903610_b20200727_m0208360849_IMG001x010_jpg_13x.txt
├── 1903610_b20200727_m0208360849_IMG001x010_jpg_13xy.txt
├── 1903610_b20200727_m0208360849_IMG001x010_jpg_13y.txt
├── 1903610_b20200727_m0208360849_IMG001x014.txt
......
PS: if you are not familiar with PASCAL VOC format, above converting is based upon names.txt, here we have chosen train.txt:
(base) bos-MacBook-Pro:convert2Yolo $ cat cevical_voc/train.txt |more
1903610_b20200727_m0208360849_IMG013x028_jpg_11y
1903676_b20200727_m0215460808_IMG016x029_jpg_12y
1903676_b20200727_m0215460808_IMG027x003_jpg_5
1903610_b20200727_m0208360849_IMG019x026_jpg_7y
......
3. Let’s put a new directory for Yolo5 now, copy 2000 annotated files (you can choose more, but don’t choose too less, I used 400 files, it causes training model error as no correct recognition:
mkdir ./yolo5_cer/train/images
find ./cevical_voc/img/ -type f | head -2000 |xargs gcp -t ./yolo5_cer/images
mkdir ./yolo5_cer/train/labels
find ./cevical_voc/xml -type f |xargs gcp -t ./yolo5_cer/train/labelsmkdir ./yolo5_cer/valid/images
find ./cevical_voc/img/ -type f | head -2000|xargs gcp -t ./yolo5_cer/vliad/images
mkdir ./yolo5_voc/valid/labels
find ./cevical_voc/xml -type f |xargs gcp -t ./yolo5_cer/valid/labels
4. And the data.yaml describes what directories will be included:
train: ./train/images
val: ./valid/images
nc: 1
names: ['kernel']
5. the last step is to make sure the image file and label file is synced, others will cause Yolo training error. running the following file in Virtual Studio.
Now the Yolo5-training-ready director will be looked like:
yolo5_cer
├── data.yaml
├── images
└── labels