KLUE RoBERTa의 special_token_id 이슈

Issue Description

첫 배포 때 klue/roberta의 special_token_id가 학습 시 klue/roberta와 달라서 fine-tuning 시 special_token_id가 모두 꼬임

관련 내용: https://github.com/KLUE-benchmark/KLUE/issues/17

Why did it happen?

1. `fairseq` 코드 내에 별도의 지정이 없는 경우, `special_token`과 `special_token_id` 모두 고정이 되어있음

bos: 0
pad: 1
eos: 2
unk: 3
mask: 4

2. 초기 `klue/roberta tokenizer`는 `BertTokenizer`를 기반으로 하고 있고 `bert`의 `special_token_index`를 그대로 사용하고 있음 (아래 참고)

pad: 0
unk: 1
cls: 2
sep: 3
mask: 4

How we resolved

1. Pretraining 때 사용된 `special_token` 순서로 `vocab.txt`를 변경

pad: 0 -> 1
unk: 1 -> 3
cls: 2 -> 0
sep: 3 -> 2
mask: 4 -> 4

2. `fairseq`에서 pretraining 할 때 사용한 `bos`, `eos`를 `cls`와 `sep`으로 mapping

tokenizer_config.json과 special_tokens_map.json에 bos_token과 eos_token 추가
- tokenizer_config.json

{
  "do_lower_case": false,
  "do_basic_tokenize": true,
  "never_split": null,
  "unk_token": "[UNK]",
  "sep_token": "[SEP]",
  "pad_token": "[PAD]",
  "cls_token": "[CLS]",
  "mask_token": "[MASK]",
  "bos_token": "[CLS]",
  "eos_token": "[SEP]",
  "tokenize_chinese_chars": true,
  "strip_accents": null,
  "model_max_length": 512
}

- special_tokens_map.json

{
  "unk_token": "[UNK]",
  "sep_token": "[SEP]",
  "pad_token": "[PAD]",
  "cls_token": "[CLS]",
  "mask_token": "[MASK]",
  "bos_token": "[CLS]",
  "eos_token": "[SEP]"
}

Dataset Description

KLUE-RoBERTa Issue

special_token_id

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KLUE RoBERTa의 special_token_id 이슈

Issue Description

Why did it happen?

1. `fairseq` 코드 내에 별도의 지정이 없는 경우, `special_token`과 `special_token_id` 모두 고정이 되어있음

2. 초기 `klue/roberta tokenizer`는 `BertTokenizer`를 기반으로 하고 있고 `bert`의 `special_token_index`를 그대로 사용하고 있음 (아래 참고)

How we resolved

1. Pretraining 때 사용된 `special_token` 순서로 `vocab.txt`를 변경

2. `fairseq`에서 pretraining 할 때 사용한 `bos`, `eos`를 `cls`와 `sep`으로 mapping

Dataset Description

KLUE-RoBERTa Issue

Clone this wiki locally

KLUE RoBERTa의 special_token_id 이슈

Issue Description

Why did it happen?

1. fairseq 코드 내에 별도의 지정이 없는 경우, special_token과 special_token_id 모두 고정이 되어있음

2. 초기 klue/roberta tokenizer는 BertTokenizer를 기반으로 하고 있고 bert의 special_token_index를 그대로 사용하고 있음 (아래 참고)

How we resolved

1. Pretraining 때 사용된 special_token 순서로 vocab.txt를 변경

2. fairseq에서 pretraining 할 때 사용한 bos, eos를 cls와 sep으로 mapping

Dataset Description

KLUE-RoBERTa Issue

Clone this wiki locally

1. `fairseq` 코드 내에 별도의 지정이 없는 경우, `special_token`과 `special_token_id` 모두 고정이 되어있음

2. 초기 `klue/roberta tokenizer`는 `BertTokenizer`를 기반으로 하고 있고 `bert`의 `special_token_index`를 그대로 사용하고 있음 (아래 참고)

1. Pretraining 때 사용된 `special_token` 순서로 `vocab.txt`를 변경

2. `fairseq`에서 pretraining 할 때 사용한 `bos`, `eos`를 `cls`와 `sep`으로 mapping