New VOC Dataset for Anno-Robot, Step by Step

Paul Xiong
2 min readJun 24, 2022

How to create a new dataset for Anno-Robot

Clone TF dataset source code, no worry, we are not going to modify it, just to copy voc.py.

$ git clone https://github.com/tensorflow/datasets.git

Using tfds to create a new empty dataset template.

$ tfds new anno_dataset

Copy voc.py to replace my_dataset.py

$ cd anno_dataset
$ cp ../datasets/tensorflow_datasets/object_detection/voc.py ./anno_dataset.py

In anno_dataset.py, modify line as:

comment first 2 VocConfig of tree VocConfig, add the last one …year=”2022". Actually, only the filenames={} matters.

In docker, start http server (default port=8000):

$ cd /
$ python3 -m http.server

To test the dataset:

  • for the 1st time build your dataset and it was never built successfully:
# cd anno_dataset
# tfds build
# tfds build --register_checksums
  • for the ≥2nd time:
# tfds build --overwrite
# tfds build --register_checksums

Please note: you will fake-pass the “tfds build” if you don’t do it with — overwrite

How to make a smaller dataset for Pascal format

`-- VOCdevkit
`-- VOC2012
|-- Annotations
|-- Annotations1
|-- ImageSets
| |-- Action
| |-- Layout
| |-- Main
| `-- Segmentation
|-- JPEGImages
|-- JPEGImages1
|-- SegmentationClass
`-- SegmentationObject

It will split treeh possible dataset when running tfds build — register_checksums by following files definition:

Main
|--train.txt, test.txt, val.txt

make the train.txt and val.txt from ‘/mnt/anno_dataset/data/tmp_test/train/VOCdevkit/VOC2012/JPEGImages’

ImageSets/Main# python3 run_make_traintxt.py
# cp train.txt val.txt

make new tar:

$ cd /mnt/anno_dataset/data/tmp_test/train
$ tar -cvf VOCdevkit.tar ./VOCdevkit

How to modify code to point a new dataset For Anno-Robot

What files are needed:

VOCOtrain.tarVOCOtest.tarconfig.json

What file to modify (anno_dataset.py)

Register above files to TFDS

# tfds build --overwrite
# tfds build --register_checksums

To train the model

--

--

Paul Xiong

Coding, implementing, optimizing ML annotation with self-supervised learning, TLDR: doctor’s labeling is the 1st priority for our Cervical AI project.