Distributed parallel training¶
This document provides a tutorial on distributed parallel training.
There are two ways to train on the Ascend AI processor: by running scripts with OpenMPI or configuring RANK_TABLE_FILE
for training.
Please ensure that the
distribute
parameter in the yaml file is set toTrue
before running the following commands for distributed training.
1. Ascend¶
Notes:
On Ascend platform, some common restrictions on using the distributed service are as follows:
-
In a single-node system, a cluster of 1, 2, 4, or 8 devices is supported. In a multi-node system, a cluster of 8 x N devices is supported.
-
Each host has four devices numbered 0 to 3 and four devices numbered 4 to 7 deployed on two different networks. During training of 2 or 4 devices, the devices must be connected and clusters cannot be created across networks. This means, when training with 4 devices, only
{0, 1, 2, 3}
and{4, 5, 6, 7}
are available. While in training with 2 devices, devices cross networks, such as{0, 4}
are not allowed. However, devices within networks, such as{0, 1}
or{1, 2}
, are allowed.
1.1 Run scripts with OpenMPI¶
On Ascend hardware platform, users can use OpenMPI's mpirun
to run distributed training with n
devices. For example, in DBNet Readme, the following command is used to train the model on devices 0
and 1
:
# n is the number of NPUs used in training
mpirun --allow-run-as-root -n 2 python tools/train.py --config configs/det/dbnet/db_r50_icdar15.yaml
Note that
mpirun
will run training on sequential devices starting from device0
. For example,mpirun -n 4 python-command
will run training on the four devices:{0, 1, 2, 3}
.
1.2 Configure RANK_TABLE_FILE for training¶
1.2.1 Running on Eight (All) Devices¶
Before using this method for distributed training, it is necessary to create an HCCL configuration file in json format, i.e. generate RANK_TABLE_FILE. The following is the command to generate the corresponding configuration file for 8 devices (for more information please refer to HCCL tools):
python hccl_tools.py --device_num "[0,8)"
hccl_8p_10234567_127.0.0.1.json
An example of the content in hccl_8p_10234567_127.0.0.1.json
:
{
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_id": "127.0.0.1",
"device": [
{
"device_id": "0",
"device_ip": "192.168.100.101",
"rank_id": "0"
},
{
"device_id": "1",
"device_ip": "192.168.101.101",
"rank_id": "1"
},
{
"device_id": "2",
"device_ip": "192.168.102.101",
"rank_id": "2"
},
{
"device_id": "3",
"device_ip": "192.168.103.101",
"rank_id": "3"
},
{
"device_id": "4",
"device_ip": "192.168.100.100",
"rank_id": "4"
},
{
"device_id": "5",
"device_ip": "192.168.101.100",
"rank_id": "5"
},
{
"device_id": "6",
"device_ip": "192.168.102.100",
"rank_id": "6"
},
{
"device_id": "7",
"device_ip": "192.168.103.100",
"rank_id": "7"
}
],
"host_nic_ip": "reserve"
}
],
"status": "completed"
}
Then start the training by running the following command:
bash ascend8p.sh
Please ensure that the
distribute
parameter in the yaml file is set toTrue
before running the command.
Here is an example of the ascend8p.sh
script for CRNN training:
#!/bin/bash
export DEVICE_NUM=8
export RANK_SIZE=8
export RANK_TABLE_FILE="./hccl_8p_01234567_127.0.0.1.json"
for ((i = 0; i < ${RANK_SIZE}; i++)); do
export DEVICE_ID=$i
export RANK_ID=$i
echo "Launching rank: ${RANK_ID}, device: ${DEVICE_ID}"
if [ $i -eq 0 ]; then
echo 'i am 0'
python -u tools/train.py --config configs/rec/crnn/crnn_resnet34_zh.yaml &> ./train.log &
else
echo 'not 0'
python -u tools/train.py --config configs/rec/crnn/crnn_resnet34_zh.yaml &> /dev/null &
fi
done
path/to/model_config.yaml
.
After the training has started, and you can find the training log train.log
in the project root directory.
1.2.2 Running on Four (Partial) Devices¶
To run training on four devices, for example, {4, 5, 6, 7}
, the RANK_TABLE_FILE
and the run script are different from those for running on eight devices.
The rank_table.json
is created by running the following command:
python hccl_tools.py --device_num "[4,8)"
This command produces the following output file:
hccl_4p_4567_127.0.0.1.json
An example of the content in hccl_4p_4567_127.0.0.1.json
:
{
"version": "1.0",
"server_count": "1",
"server_list": [
{
"server_id": "127.0.0.1",
"device": [
{
"device_id": "4",
"device_ip": "192.168.100.100",
"rank_id": "0"
},
{
"device_id": "5",
"device_ip": "192.168.101.100",
"rank_id": "1"
},
{
"device_id": "6",
"device_ip": "192.168.102.100",
"rank_id": "2"
},
{
"device_id": "7",
"device_ip": "192.168.103.100",
"rank_id": "3"
}
],
"host_nic_ip": "reserve"
}
],
"status": "completed"
}
Then start the training by running the following command:
bash ascend4p.sh
Here is an example of the ascend4p.sh
script for CRNN training:
#!/bin/bash
export DEVICE_NUM=8
export RANK_SIZE=4
export RANK_TABLE_FILE="./hccl_4p_4567_127.0.0.1.json"
for ((i = 0; i < ${RANK_SIZE}; i++)); do
export DEVICE_ID=$((i+4))
export RANK_ID=$i
echo "Launching rank: ${RANK_ID}, device: ${DEVICE_ID}"
if [ $i -eq 0 ]; then
echo 'i am 0'
python -u tools/train.py --config configs/rec/crnn/crnn_resnet34_zh.yaml &> ./train.log &
else
echo 'not 0'
python -u tools/train.py --config configs/rec/crnn/crnn_resnet34_zh.yaml &> /dev/null &
fi
done
Note that the DEVICE_ID
and RANK_ID
should be matched with hccl_4p_4567_127.0.0.1.json
.