-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
增加了说明文档
- Loading branch information
Showing
2 changed files
with
209 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
# Qwen2-0.B-Instruct 微信AI 微调 | ||
|
||
这个教程给大家提供一个 [nodebook](./train.ipynb) 文件,来让大家更好的学习。 | ||
|
||
## 模型下载 | ||
|
||
使用 modelscope 中的 snapshot_download 函数下载模型,第一个参数为模型名称,参数 cache_dir 为模型的下载路径。 | ||
|
||
|
||
```python | ||
import torch | ||
from modelscope import snapshot_download, AutoModel, AutoTokenizer | ||
import os | ||
model_dir = snapshot_download('qwen/Qwen2-7B-Instruct', cache_dir='/root/autodl-tmp', revision='master') | ||
``` | ||
|
||
## 环境配置 | ||
|
||
在完成基本环境配置和本地模型部署的情况下,你还需要安装一些第三方库,可以使用以下命令: | ||
|
||
```bash | ||
python -m pip install --upgrade pip | ||
# 更换 pypi 源加速库的安装 | ||
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple | ||
|
||
pip install modelscope==1.9.5 | ||
pip install "transformers>=4.39.0" | ||
pip install streamlit==1.24.0 | ||
pip install sentencepiece==0.1.99 | ||
pip install accelerate==0.27 | ||
pip install transformers_stream_generator==0.0.4 | ||
pip install datasets==2.18.0 | ||
pip install peft==0.10.0 | ||
|
||
``` | ||
|
||
LLM 的微调一般指指令微调过程。所谓指令微调,是说我们使用的微调数据形如: | ||
|
||
```json | ||
{ | ||
"instruction":"以下是你的好友在和你聊天,你需要和他聊天", | ||
"input":"吃了吗?", | ||
"output":"还在食堂" | ||
} | ||
``` | ||
|
||
其中,`instruction` 是用户指令,告知模型其需要完成的任务;`input` 是用户输入,是完成用户指令所必须的输入内容;`output` 是模型应该给出的输出。 | ||
|
||
|
||
|
||
|
||
## 数据格式化 | ||
|
||
`Lora` 训练的数据是需要经过格式化、编码之后再输入给模型进行训练的,如果是熟悉 `Pytorch` 模型训练流程的同学会知道,我们一般需要将输入文本编码为 input_ids,将输出文本编码为 `labels`,编码之后的结果都是多维的向量。我们首先定义一个预处理函数,这个函数用于对每一个样本,编码其输入、输出文本并返回一个编码后的字典: | ||
|
||
```python | ||
def process_func(example): | ||
MAX_LENGTH = 384 # Llama分词器会将一个中文字切分为多个token,因此需要放开一些最大长度,保证数据的完整性 | ||
input_ids, attention_mask, labels = [], [], [] | ||
instruction = tokenizer(f"<|im_start|>system\n现在你要扮演皇帝身边的女人--甄嬛<|im_end|>\n<|im_start|>user\n{example['instruction'] + example['input']}<|im_end|>\n<|im_start|>assistant\n", add_special_tokens=False) # add_special_tokens 不在开头加 special_tokens | ||
response = tokenizer(f"{example['output']}", add_special_tokens=False) | ||
input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id] | ||
attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1] # 因为eos token咱们也是要关注的所以 补充为1 | ||
labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id] | ||
if len(input_ids) > MAX_LENGTH: # 做一个截断 | ||
input_ids = input_ids[:MAX_LENGTH] | ||
attention_mask = attention_mask[:MAX_LENGTH] | ||
labels = labels[:MAX_LENGTH] | ||
return { | ||
"input_ids": input_ids, | ||
"attention_mask": attention_mask, | ||
"labels": labels | ||
} | ||
``` | ||
|
||
`Qwen2` 采用的`Prompt Template`格式如下: | ||
|
||
```text | ||
<|im_start|>system | ||
You are a helpful assistant.<|im_end|> | ||
<|im_start|>user | ||
你是谁?<|im_end|> | ||
<|im_start|>assistant | ||
我是一个有用的助手。<|im_end|> | ||
``` | ||
|
||
## 加载tokenizer和半精度模型 | ||
|
||
模型以半精度形式加载,如果你的显卡比较新的话,可以用`torch.bfolat`形式加载。对于自定义的模型一定要指定`trust_remote_code`参数为`True`。 | ||
|
||
```python | ||
tokenizer = AutoTokenizer.from_pretrained('./qwen2-0.5b/qwen/Qwen2-0___5B-Instruct/', use_fast=False, trust_remote_code=True) | ||
|
||
model = AutoModelForCausalLM.from_pretrained('./qwen2-0.5b/qwen/Qwen2-0___5B-Instruct/', device_map="auto",torch_dtype=torch.bfloat16) | ||
``` | ||
|
||
## 定义LoraConfig | ||
|
||
`LoraConfig`这个类中可以设置很多参数,但主要的参数没多少,简单讲一讲,感兴趣的同学可以直接看源码。 | ||
|
||
- `task_type`:模型类型 | ||
- `target_modules`:需要训练的模型层的名字,主要就是`attention`部分的层,不同的模型对应的层的名字不同,可以传入数组,也可以字符串,也可以正则表达式。 | ||
- `r`:`lora`的秩,具体可以看`Lora`原理 | ||
- `lora_alpha`:`Lora alaph`,具体作用参见 `Lora` 原理 | ||
|
||
`Lora`的缩放是啥嘞?当然不是`r`(秩),这个缩放就是`lora_alpha/r`, 在这个`LoraConfig`中缩放就是4倍。 | ||
|
||
```python | ||
config = LoraConfig( | ||
task_type=TaskType.CAUSAL_LM, | ||
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], | ||
inference_mode=False, # 训练模式 | ||
r=8, # Lora 秩 | ||
lora_alpha=32, # Lora alaph,具体作用参见 Lora 原理 | ||
lora_dropout=0.1# Dropout 比例 | ||
) | ||
``` | ||
|
||
## 自定义 TrainingArguments 参数 | ||
|
||
`TrainingArguments`这个类的源码也介绍了每个参数的具体作用,当然大家可以来自行探索,这里就简单说几个常用的。 | ||
|
||
- `output_dir`:模型的输出路径 | ||
- `per_device_train_batch_size`:顾名思义 `batch_size` | ||
- `gradient_accumulation_steps`: 梯度累加,如果你的显存比较小,那可以把 `batch_size` 设置小一点,梯度累加增大一些。 | ||
- `logging_steps`:多少步,输出一次`log` | ||
- `num_train_epochs`:顾名思义 `epoch` | ||
- `gradient_checkpointing`:梯度检查,这个一旦开启,模型就必须执行`model.enable_input_require_grads()`,这个原理大家可以自行探索,这里就不细说了。 | ||
|
||
```python | ||
args = TrainingArguments( | ||
output_dir="./output", | ||
per_device_train_batch_size=4, | ||
gradient_accumulation_steps=4, | ||
logging_steps=10, | ||
num_train_epochs=3, | ||
save_steps=100, | ||
learning_rate=1e-4, | ||
save_on_each_node=True, | ||
gradient_checkpointing=True | ||
) | ||
``` | ||
|
||
## 使用 Trainer 训练 | ||
|
||
```python | ||
trainer = Trainer( | ||
model=model, | ||
args=args, | ||
train_dataset=tokenized_id, | ||
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True), | ||
) | ||
trainer.train() | ||
``` | ||
|
||
## 加载 lora 权重推理 | ||
|
||
训练好了之后可以使用如下方式加载`lora`权重进行推理: | ||
|
||
```python | ||
from transformers import AutoModelForCausalLM, AutoTokenizer | ||
import torch | ||
from peft import PeftModel | ||
|
||
mode_path = './qwen2-0.5b/qwen/Qwen2-0___5B-Instruct/' | ||
lora_path = 'lora_path' | ||
|
||
# 加载tokenizer | ||
tokenizer = AutoTokenizer.from_pretrained(mode_path) | ||
|
||
# 加载模型 | ||
model = AutoModelForCausalLM.from_pretrained(mode_path, device_map="auto",torch_dtype=torch.bfloat16) | ||
|
||
# 加载lora权重 | ||
model = PeftModel.from_pretrained(model, model_id=lora_path, config=config) | ||
|
||
prompt = "你是谁?" | ||
messages = [ | ||
{"role": "system", "content": "You are a helpful assistant."}, | ||
{"role": "user", "content": prompt} | ||
] | ||
|
||
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | ||
|
||
model_inputs = tokenizer([text], return_tensors="pt").to('cuda') | ||
|
||
generated_ids = model.generate( | ||
model_inputs.input_ids, | ||
max_new_tokens=512 | ||
) | ||
generated_ids = [ | ||
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) | ||
] | ||
|
||
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] | ||
|
||
print(response) | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,22 +1,21 @@ | ||
**鉴于** 本仓库原来的训练模型 Chatllm3-6b 在低性能机器上部署比较困难,我在原基础上使用微型模型(但是较好满足微信聊天 AI 要求)的 Qwen2-0.5b-Instruct 模型完成模型训练到部署到免费 | ||
Modelspace 创空间中,比较简单,**并且可做到全程免费** 下面是流程: | ||
**鉴于** 本仓库原来的训练模型 `Chatllm3-6b` 在低性能机器上部署比较困难,我在原基础上使用微型模型 `Qwen2-0.5b-Instruct` 模型完成模型训练到部署到免费 | ||
Modelspace 创空间中,比较简单,`并且可做到全程免费` 下面是流程: | ||
*** | ||
# 第一步,[创建 Modelspace 免费 GPU](https://www.modelscope.cn/my/mynotebook/preset) | ||
![GPU](/MemoAI/img/img3.png) | ||
|
||
# 开始训练 | ||
**可以参照我的训练[模板](/MemoAI/qwen2-0.5b/train.ipynb)** | ||
**可以参照训练[模板](/MemoAI/qwen2-0.5b/train.md)** | ||
<br> | ||
把 train.json 上传,一步步点击即可<br> | ||
最后把模型上传到 Modelspace | ||
把 `train.json` 上传,一步步点击即可<br> | ||
最后把模型上传到 `Modelspace` | ||
<br /> | ||
![训练](/MemoAI/img/img4.png) | ||
# 部署到创空间 | ||
编辑 di_config.json 的一下两个字段 | ||
> "model_space": "YOUR-NAME-SPACE" | ||
编辑 `di_config.json` 的一下两个字段 | ||
`model_space: YOUR-NAME-SPACE` | ||
`model_name: YOUR-MODEL-NAME` | ||
|
||
> "model_name": "YOUR-MODEL-NAME" | ||
**然后把一下MemoAI/qwen2-0.5b的三个文件:`di_config.json`,`app.py`,`requirements.txt`上传到创空间,点击部署!** | ||
|
||
**然后把一下MemoAI/qwen2-0.5b的三个文件上传到创空间,点击部署!** | ||
|
||
**最后看看成功吧:**[成品](https://www.modelscope.cn/studios/sanbei101/qwen-haoran/summary) | ||
**最后看看成品吧:**[成品](https://www.modelscope.cn/studios/sanbei101/qwen-haoran/summary) |