CTW Dataset¶
Data Downloading¶
The CTW images dataset Official Website | Download Link
Note: Please fill in a form to download this dataset.
The images are in 26 batches, i.e. 26 different .tar archived files of the format images-trainval/ctw-trainval*.tar
. All 26 batches need to be downloaded.
The CTW annotations (in JSON Lines format i.e. .jsonl
) can be downloaded from this download link. The annotations archived file is named ctw-annotations.tar.gz
.
After downloading the zipped images, unzip the batches and collect all the images into a single folder e.g. train_val/
.
After downloading the zipped annotations, unzip them.
Finally, the directory structure should look like this (ignoring the archive files):
CTW
|--- train_val
| |--- 0000172.jpg
| |--- 0000174.jpg
| |--- ...
|--- train.jsonl
|--- val.jsonl
|--- test_cls.jsonl
|--- info.json
Data Preparation¶
For Detection Task¶
To prepare the data for text detection, you can run the following commands:
python tools/dataset_converters/convert.py \
--dataset_name ctw --task det \
--image_dir path/to/CTW/train_val/ \
--label_dir path/to/CTW/train.jsonl \
--output_path path/to/CTW/det_gt.txt
The generated standard annotation file det_gt.txt
will now be placed under the folder CTW/
.
Note that the label_dir
flag may be altered to prepare the validation data.