Merge pull request #9 from airaria/v0.1.9

V0.1.9
airaria · Apr 21, 2020 · 9b172a4 · 9b172a4
2 parents 24783d1 + dcc116c
commit 9b172a4
Show file tree

Hide file tree

Showing 18 changed files with 1,383 additions and 73 deletions.
diff --git a/README.md b/README.md
@@ -28,6 +28,22 @@ Paper: [https://arxiv.org/abs/2002.12620](https://arxiv.org/abs/2002.12620)
 
 ## Update
 
+**Apr 22, 2020**
+
+* Updated to 0.1.9 (added cache option which speeds up distillation; fixed some bugs). See details in [releases](https://github.com/airaria/TextBrewer/releases/tag/v0.1.9).
+
+* Added experimential results for distilling Electra-base to Electra-small on Chinese tasks.
+
+**Apr 16, 2020**
+
+* Fixed wrong call of zero_grad().
+* Added new argument `max_grad_norm` to distillers' `train` method. It sets strength of gradient clipping. Default -1, i.e., no gradient clipping.
+* Added new arguments `scheduler_class` and `scheduler_args` to distillers' `train` method. The old `scheduler` may cause convergence problem and is deprecated in favor of `scheduler_class` and `scheduler_args`. See the documentation for details.
+
+**Apr 7, 2020**
+
+* Added an option `is_caching_logits` to `DistillationConfig`. If `is_caching_logits` is True, the distiller will cache the batches and the output logits of the teacher model, so that those logits will only be calcuated once. It will speed up the distillation process. This feature is **only available** for `BasicDistiller` and `MultiTeacherDistiller`. **Be caution of setting it to True on large datasets, since it will store the batches and logits into memory.**
+
 **Mar 17, 2020**
 
 * Added CoNLL-2003 English NER distillation example, see [examples/conll2003_example](examples/conll2003_example).
@@ -197,20 +213,32 @@ We have performed distillation experiments on several typical English and Chines
 ### Models
 
 * For English tasks, the teacher model is [**BERT-base-cased**](https://github.com/google-research/bert).
-* For Chinese tasks, the teacher model is [**RoBERTa-wwm-ext**](https://github.com/ymcui/Chinese-BERT-wwm) released by the Joint Laboratory of HIT and iFLYTEK Research.
+* For Chinese tasks, the teacher models are [**RoBERTa-wwm-ext**](https://github.com/ymcui/Chinese-BERT-wwm) and [**Electra-base**](https://github.com/ymcui/Chinese-ELECTRA) released by the Joint Laboratory of HIT and iFLYTEK Research.
 
 We have tested different student models. To compare with public results, the student models are built with standard transformer blocks except for BiGRU which is a single-layer bidirectional GRU. The architectures are listed below. Note that the number of parameters includes the embedding layer but does not include the output layer of each specific task. 
 
-| Model                 | \#Layers | Hidden_size | Feed-forward size | \#Params | Relative size |
+#### English models
+
+| Model                 | \#Layers | Hidden size | Feed-forward size | \#Params | Relative size |
 | :--------------------- | --------- | ----------- | ----------------- | -------- | ------------- |
-| BERT-base-cased (teacher)  | 12        | 768         | 3072              | 108M     | 100%          |
-| RoBERTa-wwm-ext (teacher) | 12        | 768         | 3072              | 108M     | 100%          |
+| BERT-base-cased (teacher) | 12        | 768         | 3072              | 108M     | 100%          |
 | T6 (student)              | 6         | 768         | 3072              | 65M      | 60%           |
 | T3 (student)              | 3         | 768         | 3072              | 44M      | 41%           |
 | T3-small (student)        | 3         | 384         | 1536              | 17M      | 16%           |
 | T4-Tiny (student)         | 4         | 312         | 1200              | 14M      | 13%           |
 | BiGRU (student)           | -         | 768         | -                 | 31M      | 29%           |
 
+#### Chinese models
+
+| Model                 | \#Layers | Hidden size | Feed-forward size | \#Params | Relative size   |
+| :--------------------- | --------- | ----------- | ----------------- | -------- | ------------- |
+| RoBERTa-wwm-ext (teacher) | 12        | 768         | 3072              | 102M      | 100%          |
+| Electra-base (teacher)    | 12        | 768         | 3072              | 102M      | 100%          |
+| T3 (student)              | 3         | 768         | 3072              | 38M       | 37%           |
+| T3-small (student)        | 3         | 384         | 1536              | 14M       | 14%           |
+| T4-Tiny (student)         | 4         | 312         | 1200              | 11M       | 11%           |
+| Electra-small (student)   | 12        | 256         | 1024              | 12M       | 12%           |
+
 * T6 archtecture is the same as [DistilBERT<sup>[1]</sup>](https://arxiv.org/abs/1910.01108), [BERT<sub>6</sub>-PKD<sup>[2]</sup>](https://arxiv.org/abs/1908.09355), and  [BERT-of-Theseus<sup>[3]</sup>](https://arxiv.org/abs/2002.02925).
 * T4-tiny archtecture is the same as [TinyBERT<sup>[4]</sup>](https://arxiv.org/abs/1909.10351).
 * T3 architecure is the same as [BERT<sub>3</sub>-PKD<sup>[2]</sup>](https://arxiv.org/abs/1908.09355).
@@ -224,13 +252,14 @@ distill_config = DistillationConfig(temperature = 8, intermediate_matches = matc
 
 `matches` are differnt for different models:
 
-| Model    | matches                                                      |
-| :-------- | ------------------------------------------------------------ |
-| BiGRU    | None                                                         |
-| T6       | L6_hidden_mse + L6_hidden_smmd                               |
-| T3       | L3_hidden_mse + L3_hidden_smmd                               |
-| T3-small | L3n_hidden_mse + L3_hidden_smmd                              |
-| T4-Tiny  | L4t_hidden_mse + L4_hidden_smmd                              |
+| Model        | matches                                             |
+| :--------    | --------------------------------------------------- |
+| BiGRU        | None                                                |
+| T6           | L6_hidden_mse + L6_hidden_smmd                      |
+| T3           | L3_hidden_mse + L3_hidden_smmd                      |
+| T3-small     | L3n_hidden_mse + L3_hidden_smmd                     |
+| T4-Tiny      | L4t_hidden_mse + L4_hidden_smmd                     |
+|Electra-small | small_hidden_mse + small_hidden_smmd                |
 
 The definitions of matches are at [examples/matches/matches.py](examples/matches/matches.py).
 
@@ -267,7 +296,7 @@ Our results:
 
 | Model (ours) | MNLI  | SQuAD  | CoNLL-2003 |
 | :-------------  | --------------- | ------------- | --------------- |
-| **BERT-base-cased**  | 83.7 / 84.0     | 81.5 / 88.6   | 91.1  |
+| **BERT-base-cased** (teacher) | 83.7 / 84.0     | 81.5 / 88.6   | 91.1  |
 | BiGRU          | -               | -             | 85.3            |
 | T6             | 83.5 / 84.0     | 80.8 / 88.1   | 90.7            |
 | T3             | 81.8 / 82.7     | 76.4 / 84.9   | 87.5            |
@@ -297,16 +326,23 @@ The results are listed below.
 
 | Model           | XNLI | LCQMC | CMRC 2018 | DRCD |
 | :--------------- | ---------- | ----------- | ---------------- | ------------ |
-| **RoBERTa-wwm-ext** | 79.9       | 89.4        | 68.8 / 86.4      | 86.5 / 92.5  |
+| **RoBERTa-wwm-ext** (teacher) | 79.9       | 89.4        | 68.8 / 86.4      | 86.5 / 92.5  |
 | T3          | 78.4       | 89.0        | 66.4 / 84.2      | 78.2 / 86.4  |
 | T3-small    | 76.0       | 88.1        | 58.0 / 79.3      | 65.5 / 78.6  |
 | T4-tiny     | 76.2       | 88.4        | 61.8 / 81.8      | 73.3 / 83.5  |
 
+| Model                      | XNLI       | LCQMC       | CMRC 2018        | DRCD         |
+| :---------------           | ---------- | ----------- | ---------------- | ------------ |
+| **Electra-base** (teacher) | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  |
+| Electra-small              | 77.7       | 89.3        | 66.5 / 84.9     | 85.5 / 91.3  |
+
 
 **Note**:
 
-1. On CMRC2018 and DRCD, learning rates are 1.5e-4 and 7e-5 respectively and there is no learning rate decay.
-2. CMRC2018 and DRCD take each other as the augmentation dataset In the experiments. 
+1. When distillatoin from RoBERTa-wwm-ext, on CMRC2018 and DRCD, learning rates are 1.5e-4 and 7e-5 respectively and there is no learning rate decay.
+2. CMRC2018 and DRCD take each other as the augmentation dataset in the distillation.
+3. The settings of training Electra-base teacher model can be found at [**Chinese-ELECTRA**](https://github.com/ymcui/Chinese-ELECTRA).
+4. Electra-small student model is intialized with the [pretrained weights](https://github.com/ymcui/Chinese-ELECTRA).
 
 ## Core Concepts
 

diff --git a/README_ZH.md b/README_ZH.md
@@ -27,6 +27,22 @@ Paper: [https://arxiv.org/abs/2002.12620](https://arxiv.org/abs/2002.12620)
 
 ## 更新
 
+**Apr 22, 2020**
+
+* **版本更新至 0.1.9**，增加了为蒸馏过程提速的cache功能，修复了若干bug。细节参见 [releases](https://github.com/airaria/TextBrewer/releases/tag/v0.1.9)。
+
+* 增加中文任务上了从Electra-base蒸馏到Electra-small的实验结果。
+
+**Apr 16, 2020**
+
+* 修复了错误调用zero_grad的问题。
+* distiller的`train`方法中增加了`max_grad_norm`参数用以控制梯度裁剪。默认值-1，不开启梯度裁剪。
+* distiller的`train`方法中增加了`scheduler_class`和`scheduler_args`两个参数。建议做学习率调整时传递这两个参数而不是`scheduler`；使用`scheduler`可能会导致收敛上的问题。详细区别见API文档。
+
+**Apr 7, 2020**
+
+* 为蒸馏配置(`DistillationConfig`)增加了`is_caching_logits`选项。若`is_caching_logits`设为True，disitiller将缓存数据集的每一个batch和教师模型在每个batch上的logits输出，以避免重复计算，可以为蒸馏过程提速。**仅适用于**`BasicDistiller`和`MultiTaskDistiller`。因为会将logits和batches存入内存，请避免在大数据集上开启此选项。
+
 **Mar 17, 2020**
 
 * examples中添加了CoNLL-2003英文NER任务上的蒸馏的示例代码，见 [examples/conll2003_example](examples/conll2003_example)。
@@ -195,20 +211,32 @@ with distiller:
 ### 模型
 
 * 对于英文任务，教师模型为[**BERT-base-cased**](https://github.com/google-research/bert)
-* 对于中文任务，教师模型为HFL发布的[**RoBERTa-wwm-ext**](https://github.com/ymcui/Chinese-BERT-wwm)
+* 对于中文任务，教师模型为HFL发布的[**RoBERTa-wwm-ext**](https://github.com/ymcui/Chinese-BERT-wwm) 与 [**Electra-base**](https://github.com/ymcui/Chinese-ELECTRA)
 
 我们测试了不同的学生模型，为了与已有公开结果相比较，除了BiGRU都是和BERT一样的多层Transformer结构。模型的参数如下表所示。需要注意的是，参数量的统计包括了embedding层，但不包括最终适配各个任务的输出层。
 
+#### 英文模型
+
 | Model                 | \#Layers | Hidden size | Feed-forward size | \#Params | Relative size |
 | :--------------------- | --------- | ----------- | ----------------- | -------- | ------------- |
 | BERT-base-cased (教师) | 12        | 768         | 3072              | 108M     | 100%          |
-| RoBERTa-wwm-ext (教师) | 12        | 768         | 3072              | 108M     | 100%          |
-| T6 (学生)  | 6         | 768         | 3072              | 65M      | 60%           |
-| T3 (学生)            | 3         | 768         | 3072              | 44M      | 41%           |
+| T6 (学生)              | 6         | 768         | 3072              | 65M      | 60%           |
+| T3 (学生)              | 3         | 768         | 3072              | 44M      | 41%           |
 | T3-small (学生)        | 3         | 384         | 1536              | 17M      | 16%           |
-| T4-Tiny (学生)  | 4         | 312         | 1200              | 14M      | 13%           |
+| T4-Tiny (学生)         | 4         | 312         | 1200              | 14M      | 13%           |
 | BiGRU (学生)           | -         | 768         | -                 | 31M      | 29%           |
 
+#### 中文模型
+
+| Model                 | \#Layers | Hidden size | Feed-forward size | \#Params | Relative size   |
+| :--------------------- | --------- | ----------- | ----------------- | -------- | ------------- |
+| RoBERTa-wwm-ext (教师) | 12        | 768         | 3072              | 102M      | 100%          |
+| Electra-base (教师)    | 12        | 768         | 3072              | 102M      | 100%          |
+| T3 (学生)              | 3         | 768         | 3072              | 38M       | 37%           |
+| T3-small (学生)        | 3         | 384         | 1536              | 14M       | 14%           |
+| T4-Tiny (学生)         | 4         | 312         | 1200              | 11M       | 11%           |
+| Electra-small (学生)   | 12        | 256         | 1024              | 12M       | 12%           |
+
 * T6的结构与[DistilBERT<sup>[1]</sup>](https://arxiv.org/abs/1910.01108), [BERT<sub>6</sub>-PKD<sup>[2]</sup>](https://arxiv.org/abs/1908.09355), [BERT-of-Theseus<sup>[3]</sup>](https://arxiv.org/abs/2002.02925) 相同。
 * T4-tiny的结构与 [TinyBERT<sup>[4]</sup>](https://arxiv.org/abs/1909.10351) 相同。
 * T3的结构与[BERT<sub>3</sub>-PKD<sup>[2]</sup>](https://arxiv.org/abs/1908.09355) 相同。
@@ -222,13 +250,14 @@ distill_config = DistillationConfig(temperature = 8, intermediate_matches = matc
 
 不同的模型用的`matches`我们采用了以下配置：
 
-| Model    | matches                                                      |
-| :-------- | ------------------------------------------------------------ |
-| BiGRU    | None                                                         |
-| T6       | L6_hidden_mse + L6_hidden_smmd                               |
-| T3       | L3_hidden_mse + L3_hidden_smmd                               |
-| T3-small | L3n_hidden_mse + L3_hidden_smmd                              |
-| T4-Tiny  | L4t_hidden_mse + L4_hidden_smmd                              |
+| Model        | matches                                             |
+| :--------    | --------------------------------------------------- |
+| BiGRU        | None                                                |
+| T6           | L6_hidden_mse + L6_hidden_smmd                      |
+| T3           | L3_hidden_mse + L3_hidden_smmd                      |
+| T3-small     | L3n_hidden_mse + L3_hidden_smmd                     |
+| T4-Tiny      | L4t_hidden_mse + L4_hidden_smmd                     |
+|Electra-small | small_hidden_mse + small_hidden_smmd                |
 
 各种matches的定义在[examples/matches/matches.py](examples/matches/matches.py)中。均使用GeneralDistiller进行蒸馏。
 
@@ -262,7 +291,7 @@ Our results:
 
 | Model (ours) | MNLI | SQuAD | CoNLL-2003 |
 | :-------------  | --------------- | ------------- | --------------- |
-| **BERT-base-cased**  | 83.7 / 84.0     | 81.5 / 88.6   | 91.1  |
+| **BERT-base-cased** (教师) | 83.7 / 84.0     | 81.5 / 88.6   | 91.1  |
 | BiGRU          | -               | -             | 85.3            |
 | T6             | 83.5 / 84.0     | 80.8 / 88.1   | 90.7            |
 | T3             | 81.8 / 82.7     | 76.4 / 84.9   | 87.5            |
@@ -290,16 +319,22 @@ Our results:
 
 | Model           | XNLI | LCQMC | CMRC 2018 | DRCD |
 | :--------------- | ---------- | ----------- | ---------------- | ------------ |
-| **RoBERTa-wwm-ext** | 79.9       | 89.4        | 68.8 / 86.4      | 86.5 / 92.5  |
+| **RoBERTa-wwm-ext** (教师) | 79.9       | 89.4        | 68.8 / 86.4      | 86.5 / 92.5  |
 | T3          | 78.4       | 89.0        | 66.4 / 84.2      | 78.2 / 86.4  |
 | T3-small    | 76.0       | 88.1        | 58.0 / 79.3      | 65.5 / 78.6  |
 | T4-tiny     | 76.2       | 88.4        | 61.8 / 81.8      | 73.3 / 83.5  |
 
-说明：
+| Model                  | XNLI       | LCQMC       | CMRC 2018        | DRCD         |
+| :---------------       | ---------- | ----------- | ---------------- | ------------ |
+| **Electra-base** (教师) | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  |
+| Electra-small          | 77.7       | 89.3        | 66.5 / 84.9     | 85.5 / 91.3  |
 
-1. 蒸馏CMRC2018和DRCD上时学习率分别为1.5e-4和7e-5，并且不采用学习率衰减
-2. CMRC2018和DRCD两个任务上他们互作为增强数据
+说明：
 
+1. 以RoBERTa-wwm-ext为教师模型蒸馏CMRC2018和DRCD时，学习率分别为1.5e-4和7e-5，并且不采用学习率衰减
+2. CMRC2018和DRCD两个任务上蒸馏时他们互作为增强数据
+3. Electra-base的教师模型训练设置参考自[**Chinese-ELECTRA**](https://github.com/ymcui/Chinese-ELECTRA)
+4. Electra-small学生模型采用[预训练权重](https://github.com/ymcui/Chinese-ELECTRA)初始化
 
 
 ## 核心概念

diff --git a/docs/source/ExperimentResults.md b/docs/source/ExperimentResults.md
@@ -1,7 +1,7 @@
 # Experimental Results
 
 
-## Results on English Datasets
+## English Datasets
 
 ### MNLI
 
@@ -91,7 +91,7 @@
 
 **Note**: HotpotQA is used for data augmentation on CoNLL-2003.
 
-## Results on Chinese Datasets
+## Chinese Datasets (RoBERTa-wwm-ext as the teacher)
 
 ### XNLI
 
@@ -123,4 +123,23 @@
 | T4-tiny (student)             | -                | -            |
 | &nbsp;&nbsp;+  DA             | 61.8 / 81.8      | 73.3 / 83.5  |
 
-**Note**: CMRC2018 and DRCD take each other as the augmentation dataset In the experiments. 
+**Note**: CMRC2018 and DRCD take each other as the augmentation dataset on the experiments. 
+
+## Chinese Datasets (Electra-base as the teacher)
+
+* Training without Distillation:
+
+| Model                                | XNLI       | LCQMC       | CMRC 2018       | DRCD         |
+| :------------------------------------|------------|-------------| ----------------| -------------|
+| **Electra-base** (teacher)           | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  |
+| Electra-small (initialized with pretrained weights) | 72.5 | 86.3 | 62.9 / 80.2   | 79.4 / 86.4  |
+
+* Single-teacher distillation with `GeneralDistiller`:
+
+| Model                                | XNLI       | LCQMC       | CMRC 2018       | DRCD         |
+| :------------------------------------|------------|-------------|-----------------| -------------|
+| **Electra-base** (teacher)           | 77.8       | 89.8        | 65.6 / 84.7     | 86.9 / 92.3  |
+| Electra-small (random initialized)   | 77.2       | 89.0        | 66.5 / 84.9     | 84.8 / 91.0  |
+| Electra-small (initialized with pretrained weights) | 77.7 | 89.3 | 66.5 / 84.9   | 85.5 / 91.3  |
+
+**Note**: A good initialization of the student (Electra-small) improves the performance.