Chinese-Text-Recognition¶
Data Downloading¶
Following the setup in Benchmarking-Chinese-Text-Recognition, we use the same training, validation and evaliation data as described in Datasets section.
Please download the following LMDB files as introduced in Downloads section:
- scene datasets: The union dataset contains RCTW, ReCTS, LSVT, ArT, CTW
- web: MTWI
- document: generated with Text Render
- handwriting dataset: SCUT-HCCDoc
Data Structure¶
After downloading the files, please put all training files under the same folder training
, all validation data under validation
folder, and all evaluation data under evaluation
.
The data structure should be like:
chinese-text-recognition/
├── evaluation
│ ├── document_test
| | ├── data.mdb
| │ └── lock.mdb
│ ├── handwriting_test
| | ├── data.mdb
| │ └── lock.mdb
│ ├── scene_test
| | ├── data.mdb
| │ └── lock.mdb
│ └── web_test
| ├── data.mdb
| └── lock.mdb
├── training
│ ├── document_train
| | ├── data.mdb
| │ └── lock.mdb
│ ├── handwriting_train
| | ├── data.mdb
| │ └── lock.mdb
│ ├── scene_train
| | ├── data.mdb
| │ └── lock.mdb
│ └── web_train
| ├── data.mdb
| └── lock.mdb
└── validation
├── document_val
| ├── data.mdb
│ └── lock.mdb
├── handwriting_val
| ├── data.mdb
│ └── lock.mdb
├── scene_val
| ├── data.mdb
│ └── lock.mdb
└── web_val
├── data.mdb
└── lock.mdb
Data Configuration¶
To use the datasets, you can specify the datasets as follow in configuration file.
Model Training¶
...
train:
...
dataset:
type: LMDBDataset
dataset_root: dir/to/chinese-text-recognition/ # Root dir of training dataset
data_dir: training/ # Dir of training dataset, concatenated with `dataset_root` to be the complete dir of training dataset
...
eval:
dataset:
type: LMDBDataset
dataset_root: dir/to/chinese-text-recognition/ # Root dir of validation dataset
data_dir: validation/ # Dir of validation dataset, concatenated with `dataset_root` to be the complete dir of validation dataset
...
Model Evaluation¶
...
train:
# NO NEED TO CHANGE ANYTHING IN TRAIN SINCE IT IS NOT USED
...
eval:
dataset:
type: LMDBDataset
dataset_root: dir/to/chinese-text-recognition/ # Root dir of evaluation dataset
data_dir: evaluation/ # Dir of evaluation dataset, concatenated with `dataset_root` to be the complete dir of evaluation dataset
...