Skip to content

Latest commit

 

History

History
24 lines (17 loc) · 503 Bytes

File metadata and controls

24 lines (17 loc) · 503 Bytes

Language Modeling Benchmark

Prepare the language modeling benchmarking datasets. In order to help reproduce the papers, we use the tokenized corpus as the training/validation/testing dataset.

# WikiText-2
nlp_data prepare_lm --dataset wikitext2

# WikiText-103
nlp_data prepare_lm --dataset wikitext103

# enwik8
nlp_data prepare_lm --dataset enwik8

# Text-8
nlp_data prepare_lm --dataset text8

# Google One-Billion-Word
nlp_data prepare_lm --dataset gbw

Happy language modeling :)