Skip to content

Commit

Permalink
add offload profile docs and refine exit msg (#34)
Browse files Browse the repository at this point in the history
  • Loading branch information
SeaOfOcean authored Dec 21, 2023
1 parent 737a092 commit 0df2ea6
Show file tree
Hide file tree
Showing 4 changed files with 57 additions and 2 deletions.
4 changes: 2 additions & 2 deletions chatlearn/launcher/dlc_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,8 +210,8 @@ def __init__(self):
def stop(self):
self.quit_event.set()
self.log_monitor_thread.join(2)
_, msg = execute("ray stop")
logger.info(msg)
ray.shutdown()
logger.info("Execute ray.shutdown before the program exits. Done ...")

def start_exit_listener(self):
atexit.register(self.stop)
Expand Down
2 changes: 2 additions & 0 deletions docs/zh/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ ChatLearn Documentation
tutorial/tutorial_bloom
tutorial/continue_train
tutorial/custom_model_flow
tutorial/offload
tutorial/profile

|
|
Expand Down
21 changes: 21 additions & 0 deletions docs/zh/tutorial/offload.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Offload

随着模型规模变大,为了充分利用有限资源达到最佳训练性能,我们可以借助 Offload 的技术来减少训练过程中的显存占用,来增大 batch size 以提升整体的训练效率。
目前 ChatLearn 中支持了 Optimizer State Offload,未来我们会支持更多参数的 Offload。

## Optimizer State Offload
用户可以配置模型的 `offload_optimizer_states` (bool, 默认为 False) 参数来指定是否开启 Optimizer State Offload 。
如果 `offload_optimizer_states == True`, 将在模型执行前将 Optimizer State onload 到 GPU,并在模型执行完成 将 Optimizer State offload
到 CPU。

以下这个例子中,我们将对 `ppo_policy` 这个模型开启 Optimizer State Offload 。

```yaml
ppo_policy:
model_config_file: ppo_policy.yaml
num_device: 8
trainable: True
offload_optimizer_states: True
```
完整示例可以参考 [llama2 配置](../../../examples/megatron/step3_rlhf/configs/llama2/rlhf.yaml)。
32 changes: 32 additions & 0 deletions docs/zh/tutorial/profile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Profile

ChatLearn 提供了两种 Profile 的方式:
1. torch profiler
2. nsys

注意:对于大模型,profile 的结果会非常大,建议在 profile 的时候减小模型尺寸。

## Torch Profiler

用户可以在系统的主配置文件中配置 rlhf 配置 `profiler_dir: path_to_profile_dir` 来开启 Torch profiler。

```yaml
profiler_dir: path_to_profile_dir
```
## nsys
用户可以在系统的主配置文件中配置 rlhf 配置 `nsys: True` 来开启 nsys 的 profiler。

```yaml
rlhf:
nsys: True
```

在启动程序的时候,需要在执行命令前加上 nsys 的启动参数,可以参考下述命令

```bash
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s none --capture-range=cudaProfilerApi --capture-range-end=stop-shutdown --cudabacktrace=true -x true --force-overwrite true -o my_profile \
python train_rlhf.py XXX
```

0 comments on commit 0df2ea6

Please sign in to comment.