-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add offload profile docs and refine exit msg (#34)
- Loading branch information
1 parent
737a092
commit 0df2ea6
Showing
4 changed files
with
57 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Offload | ||
|
||
随着模型规模变大,为了充分利用有限资源达到最佳训练性能,我们可以借助 Offload 的技术来减少训练过程中的显存占用,来增大 batch size 以提升整体的训练效率。 | ||
目前 ChatLearn 中支持了 Optimizer State Offload,未来我们会支持更多参数的 Offload。 | ||
|
||
## Optimizer State Offload | ||
用户可以配置模型的 `offload_optimizer_states` (bool, 默认为 False) 参数来指定是否开启 Optimizer State Offload 。 | ||
如果 `offload_optimizer_states == True`, 将在模型执行前将 Optimizer State onload 到 GPU,并在模型执行完成 将 Optimizer State offload | ||
到 CPU。 | ||
|
||
以下这个例子中,我们将对 `ppo_policy` 这个模型开启 Optimizer State Offload 。 | ||
|
||
```yaml | ||
ppo_policy: | ||
model_config_file: ppo_policy.yaml | ||
num_device: 8 | ||
trainable: True | ||
offload_optimizer_states: True | ||
``` | ||
完整示例可以参考 [llama2 配置](../../../examples/megatron/step3_rlhf/configs/llama2/rlhf.yaml)。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Profile | ||
|
||
ChatLearn 提供了两种 Profile 的方式: | ||
1. torch profiler | ||
2. nsys | ||
|
||
注意:对于大模型,profile 的结果会非常大,建议在 profile 的时候减小模型尺寸。 | ||
|
||
## Torch Profiler | ||
|
||
用户可以在系统的主配置文件中配置 rlhf 配置 `profiler_dir: path_to_profile_dir` 来开启 Torch profiler。 | ||
|
||
```yaml | ||
profiler_dir: path_to_profile_dir | ||
``` | ||
## nsys | ||
用户可以在系统的主配置文件中配置 rlhf 配置 `nsys: True` 来开启 nsys 的 profiler。 | ||
|
||
```yaml | ||
rlhf: | ||
nsys: True | ||
``` | ||
|
||
在启动程序的时候,需要在执行命令前加上 nsys 的启动参数,可以参考下述命令 | ||
|
||
```bash | ||
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s none --capture-range=cudaProfilerApi --capture-range-end=stop-shutdown --cudabacktrace=true -x true --force-overwrite true -o my_profile \ | ||
python train_rlhf.py XXX | ||
``` | ||
|