Distributed training with mixed precision and ZeRO parallelism¶
The Trainer supports distributed training and mixed precision, which means you can also use it in a script. To enable both of these features:
See examples/transformers/llama/finetune_with_mindspore_trainer.py
for more detail.
- Add the
is_distribute
argument to enable distribute training. - Add the
fp16
orbf16
argument to enable mixed precision. - Add the
zero_stage
argument to enable optimizer parallelism withZeRO
algorithm. - Set the number of global/local NPUs to use with the
worker_num
/local_worker_num
argument.
msrun --bind_core=True --worker_num=8 --local_worker_num=8 --master_port=9000 --log_dir=outputs/parallel_logs \
python finetune_with_mindspore_trainer.py \
--model_path $local_path/meta-llama/Meta-Llama-3-8B \
--dataset_path $local_path/yelp_review_full \
--output_dir ./outputs \
--bf16 \
--zero_stage 2 \
--is_distribute True
Another example implemented through native MindSpore, see examples/transformers/llama/finetune_in_native_mindspore.py
for more detail.
msrun --bind_core=True --worker_num=8 --local_worker_num=8 --master_port=9000 --log_dir=outputs/parallel_logs \
python finetune_in_native_mindspore.py \
--model_path meta-llama/Meta-Llama-3-8B \
--dataset_path Yelp/yelp_review_full \
--bf16 \
--zero_stage 2 \
--is_distribute True