add offload profile docs and refine exit msg (#34)

alibaba · Dec 21, 2023 · 0df2ea6 · 0df2ea6
1 parent 737a092
commit 0df2ea6
Show file tree

Hide file tree

Showing 4 changed files with 57 additions and 2 deletions.
diff --git a/chatlearn/launcher/dlc_utils.py b/chatlearn/launcher/dlc_utils.py
@@ -210,8 +210,8 @@ def __init__(self):
     def stop(self):
         self.quit_event.set()
         self.log_monitor_thread.join(2)
-        _, msg = execute("ray stop")
-        logger.info(msg)
+        ray.shutdown()
+        logger.info("Execute ray.shutdown before the program exits. Done ...")
 
     def start_exit_listener(self):
         atexit.register(self.stop)

diff --git a/docs/zh/index.rst b/docs/zh/index.rst
@@ -28,6 +28,8 @@ ChatLearn Documentation
    tutorial/tutorial_bloom
    tutorial/continue_train
    tutorial/custom_model_flow
+   tutorial/offload
+   tutorial/profile
 
 |
 |

diff --git a/docs/zh/tutorial/offload.md b/docs/zh/tutorial/offload.md
@@ -0,0 +1,21 @@
+# Offload
+
+随着模型规模变大，为了充分利用有限资源达到最佳训练性能，我们可以借助 Offload 的技术来减少训练过程中的显存占用，来增大 batch size 以提升整体的训练效率。
+目前 ChatLearn 中支持了 Optimizer State Offload，未来我们会支持更多参数的 Offload。
+
+## Optimizer State Offload
+用户可以配置模型的 `offload_optimizer_states` (bool, 默认为 False) 参数来指定是否开启 Optimizer State Offload 。
+如果 `offload_optimizer_states == True`, 将在模型执行前将 Optimizer State onload 到 GPU，并在模型执行完成 将 Optimizer State offload 
+ 到 CPU。
+
+以下这个例子中，我们将对 `ppo_policy` 这个模型开启 Optimizer State Offload 。
+
+```yaml
+  ppo_policy:
+    model_config_file: ppo_policy.yaml
+    num_device: 8
+    trainable: True
+    offload_optimizer_states: True
+```
+
+完整示例可以参考 [llama2 配置](../../../examples/megatron/step3_rlhf/configs/llama2/rlhf.yaml)。
diff --git a/docs/zh/tutorial/profile.md b/docs/zh/tutorial/profile.md
@@ -0,0 +1,32 @@
+# Profile
+
+ChatLearn 提供了两种 Profile 的方式：
+1. torch profiler
+2. nsys
+
+注意：对于大模型，profile 的结果会非常大，建议在 profile 的时候减小模型尺寸。
+
+## Torch Profiler
+
+用户可以在系统的主配置文件中配置 rlhf 配置 `profiler_dir: path_to_profile_dir` 来开启 Torch profiler。
+
+```yaml
+profiler_dir: path_to_profile_dir
+```
+
+## nsys
+
+用户可以在系统的主配置文件中配置 rlhf 配置 `nsys: True` 来开启 nsys 的 profiler。
+
+```yaml
+rlhf:
+  nsys: True
+```
+
+在启动程序的时候，需要在执行命令前加上 nsys 的启动参数，可以参考下述命令
+
+```bash
+nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s none  --capture-range=cudaProfilerApi --capture-range-end=stop-shutdown --cudabacktrace=true -x true --force-overwrite true -o my_profile \
+python train_rlhf.py XXX
+```
+