SQuAD datasets is distributed under the CC BY-SA 4.0 license.
Run the following command to download squad
python3 prepare_squad.py --version 1.1 # Squad 1.1
python3 prepare_squad.py --version 2.0 # Squad 2.0
For all datasets we support, we provide command-line-toolkits for downloading them as
nlp_data prepare_squad --version 1.1
nlp_data prepare_squad --version 2.0
Directory structure of the squad dataset will be as follows, where version
can be 1.1 or 2.0:
squad
├── train-v{version}.json
├── dev-v{version}.json
Following BSD-3-Clause License, we uploaded the SearchQA to our S3 bucket and provide the link to download the processed txt files. Please check out the Google drive link to download to raw and split files collected through web search using the scraper from GitHub repository.
Download SearchQA Dataset with python command or Command-line Toolkits
python3 prepare_searchqa.py
# Or download with command-line toolkits
nlp_data prepare_searchqa
Directory structure of the SearchQA dataset will be as follows
searchqa
├── train.txt
├── val.txt
├── test.txt
TriviaQA is an open domain QA dataset. See more useful scripts in Offical Github.
Run the following command to download TriviaQA
python3 prepare_triviaqa.py --type rc # Download TriviaQA version 1.0 for RC (2.5G)
python3 prepare_triviaqa.py --type unfiltered # Download unfiltered TriviaQA version 1.0 (604M)
# Or download with command-line toolkits
nlp_data prepare_triviaqa --type rc
nlp_data prepare_triviaqa --type unfiltered
Directory structure of the triviaqa (rc and unfiltered) dataset will be as follows:
triviaqa
├── triviaqa-rc
├── qa
├── verified-web-dev.json
├── web-dev.json
├── web-train.json
├── web-test-without-answers.json
├── verified-wikipedia-dev.json
├── wikipedia-test-without-answers.json
├── wikipedia-dev.json
├── wikipedia-train.json
├── evidence
├── web
├── wikipedia
├── triviaqa-unfiltered
├── unfiltered-web-train.json
├── unfiltered-web-dev.json
├── unfiltered-web-test-without-answers.json
HotpotQA is distributed under a CC BY-SA 4.0 License. We only provide download scripts (run by the following command), and please check out the GitHub repository for the details of preprocessing and evaluation.
python3 prepare_hotpotqa.py
# Or download with command-line toolkits
nlp_data prepare_hotpotqa
Directory structure of the hotpotqa dataset will be as follows
hotpotqa
├── hotpot_train_v1.1.json
├── hotpot_dev_fullwiki_v1.json
├── hotpot_dev_distractor_v1.json
├── hotpot_test_fullwiki_v1.json
NaturalQuestions is an open domain QA dataset. This dataset contains questions from real users. For more details about this dataset, check out https://ai.google.com/research/NaturalQuestions
Run the following command to download NaturalQuestions and extract gz files.
python3 prepare_naturalquestions.py --extract
# Download NaturalQuestions simplified version 1.0(5.4G)
# Or download with command-line toolkits
nlp_data prepare_naturalquestions --extract
If you do not want to extract gz files, just run:
python3 prepare_naturalquestions.py
# Or download with command-line toolkits
nlp_data prepare_naturalquestions
Directory structure of the NaturalQuestions dataset will be as follows
NaturalQuestions
├── v1.0-simplified_simplified-nq-train.jsonl
├── nq-dev-all.jsonl