Skip to content

Commit

Permalink
Merge pull request #39 from p-lambda/dev
Browse files Browse the repository at this point in the history
V1.1 updates
  • Loading branch information
ssagawa authored Mar 10, 2021
2 parents 28ef873 + 9c84fec commit b38304b
Show file tree
Hide file tree
Showing 48 changed files with 2,517 additions and 522 deletions.
63 changes: 47 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ pip install wilds
If you have already installed it, please check that you have the latest version:
```bash
python -c "import wilds; print(wilds.__version__)"
# This should print "1.0.0". If it doesn't, update by running:
# This should print "1.1.0". If it doesn't, update by running:
pip install -U wilds
```

Expand All @@ -42,15 +42,15 @@ pip install -e .

### Requirements
- numpy>=1.19.1
- ogb>=1.2.6
- outdated>=0.2.0
- pandas>=1.1.0
- pillow>=7.2.0
- torch>=1.7.0
- tqdm>=4.53.0
- pytz>=2020.4
- outdated>=0.2.0
- ogb>=1.2.3
- torch>=1.7.0
- torch-scatter>=2.0.5
- torch-geometric>=1.6.1
- tqdm>=4.53.0

Running `pip install wilds` or `pip install -e .` will automatically check for and install all of these requirements
except for the `torch-scatter` and `torch-geometric` packages, which require a [quick manual install](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html#installation-via-binaries).
Expand All @@ -70,39 +70,69 @@ To run these scripts, you will need to install these additional dependencies:

All baseline experiments in the paper were run on Python 3.8.5 and CUDA 10.1.

## Usage
### Default models
In the `examples/` folder, we provide a set of scripts that we used to train models on the WILDS package. These scripts are configured with the default models and hyperparameters that we used for all of the baselines described in our paper. All baseline results in the paper can be easily replicated with commands like:

## Using the example scripts

In the `examples/` folder, we provide a set of scripts that can be used to download WILDS datasets and train models on them.
These scripts are configured with the default models and hyperparameters that we used for all of the baselines described in our paper. All baseline results in the paper can be easily replicated with commands like:

```bash
cd examples
python run_expt.py --dataset iwildcam --algorithm ERM --root_dir data
python run_expt.py --dataset civilcomments --algorithm groupDRO --root_dir data
python examples/run_expt.py --dataset iwildcam --algorithm ERM --root_dir data
python examples/run_expt.py --dataset civilcomments --algorithm groupDRO --root_dir data
```

The scripts are set up to facilitate general-purpose algorithm development: new algorithms can be added to `examples/algorithms` and then run on all of the WILDS datasets using the default models.

The first time you run these scripts, you might need to download the datasets. You can do so with the `--download` argument, for example:
```
python run_expt.py --dataset civilcomments --algorithm groupDRO --root_dir data --download
python examples/run_expt.py --dataset civilcomments --algorithm groupDRO --root_dir data --download
```

Alternatively, you can use the standalone `wilds/download_datasets.py` script to download the datasets, for example:

```bash
python wilds/download_datasets.py --root_dir data
```

This will download all datasets to the specified `data` folder. You can also use the `--datasets` argument to download particular datasets.

These are the sizes of each of our datasets, as well as their approximate time taken to train and evaluate the default model for a single ERM run using a NVIDIA V100 GPU.

| Dataset command | Modality | Download size (GB) | Size on disk (GB) | Train+eval time (Hours) |
|-----------------|----------|--------------------|-------------------|-------------------------|
| iwildcam | Image | 11 | 25 | 7 |
| camelyon17 | Image | 10 | 15 | 2 |
| ogb-molpcba | Graph | 0.04 | 2 | 15 |
| civilcomments | Text | 0.1 | 0.3 | 4.5 |
| fmow | Image | 50 | 55 | 6 |
| poverty | Image | 12 | 14 | 5 |
| amazon | Text | 6.6 | 7 | 5 |
| py150 | Text | 0.1 | 0.8 | 9.5 |

While the `camelyon17` dataset is small and fast to train on, we advise against using it as the only dataset to prototype methods on, as the test performance of models trained on this dataset tend to exhibit a large degree of variability over random seeds.

The image datasets (`iwildcam`, `camelyon17`, `fmow`, and `poverty`) tend to have high disk I/O usage. If training time is much slower for you than the approximate times listed above, consider checking if I/O is a bottleneck (e.g., by moving to a local disk if you are using a network drive, or by increasing the number of data loader workers). To speed up training, you could also disable evaluation at each epoch or for all splits by toggling `--evaluate_all_splits` and related arguments.

We have an [executable version](https://wilds.stanford.edu/codalab) of our paper on CodaLab that contains the exact commands, code, and data used for the experiments reported in our paper. Trained model weights for all datasets can also be found there.


## Using the WILDS package
### Data loading

The WILDS package provides a simple, standardized interface for all datasets in the benchmark.
This short Python snippet covers all of the steps of getting started with a WILDS dataset, including dataset download and initialization, accessing various splits, and preparing a user-customizable data loader.

```py
>>> from wilds.datasets.iwildcam_dataset import IWildCamDataset
>>> from wilds import get_dataset
>>> from wilds.common.data_loaders import get_train_loader
>>> import torchvision.transforms as transforms

# Load the full dataset, and download it if necessary
>>> dataset = IWildCamDataset(download=True)
>>> dataset = get_dataset(dataset='iwildcam', download=True)

# Get the training set
>>> train_data = dataset.get_subset('train',
... transform=transforms.Compose([transforms.Resize((224,224)),
... transform=transforms.Compose([transforms.Resize((448,448)),
... transforms.ToTensor()]))

# Prepare the standard data loader
Expand Down Expand Up @@ -171,11 +201,12 @@ Invoking the `eval` method of each dataset yields all metrics reported in the pa
>>> dataset.eval(all_y_pred, all_y_true, all_metadata)
{'recall_macro_all': 0.66, ...}
```
Most `eval` methods take in predicted labels for `all_y_pred` by default, but the default inputs vary across datasets and are documented in the `eval` docstrings of the corresponding dataset class.

## Citing WILDS
If you use WILDS datasets in your work, please cite [our paper](https://arxiv.org/abs/2012.07421) ([Bibtex](https://wilds.stanford.edu/assets/files/bibtex.md)):

- **WILDS: A Benchmark of in-the-Wild Distribution Shifts** (2020). Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang.
- **WILDS: A Benchmark of in-the-Wild Distribution Shifts** (2021). Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang.

Please also cite the original papers that introduce the datasets, as listed on the [datasets page](https://wilds.stanford.edu/datasets/).

Expand Down
157 changes: 157 additions & 0 deletions dataset_preprocessing/amazon_yelp/subsample_amazon.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
import argparse
import csv
import os

import pandas as pd
import numpy as np

# Fix the seed for reproducibility
np.random.seed(0)

"""
Subsample the Amazon dataset.
Usage:
python dataset_preprocessing/amazon_yelp/subsample_amazon.py <path> <frac>
"""

NOT_IN_DATASET = -1
# Split: {'train': 0, 'val': 1, 'id_val': 2, 'test': 3, 'id_test': 4}
TRAIN, OOD_VAL, ID_VAL, OOD_TEST, ID_TEST = range(5)


def main(dataset_path, frac=0.25):
def output_dataset_sizes(split_df):
print("-" * 50)
print(f'Train size: {len(split_df[split_df["split"] == TRAIN])}')
print(f'Val size: {len(split_df[split_df["split"] == OOD_VAL])}')
print(f'ID Val size: {len(split_df[split_df["split"] == ID_VAL])}')
print(f'Test size: {len(split_df[split_df["split"] == OOD_TEST])}')
print(f'ID Test size: {len(split_df[split_df["split"] == ID_TEST])}')
print(
f'Number of examples not included: {len(split_df[split_df["split"] == NOT_IN_DATASET])}'
)
print("-" * 50)
print("\n")

data_df = pd.read_csv(
os.path.join(dataset_path, "reviews.csv"),
dtype={
"reviewerID": str,
"asin": str,
"reviewTime": str,
"unixReviewTime": int,
"reviewText": str,
"summary": str,
"verified": bool,
"category": str,
"reviewYear": int,
},
keep_default_na=False,
na_values=[],
quoting=csv.QUOTE_NONNUMERIC,
)

user_csv_path = os.path.join(dataset_path, "splits", "user.csv")
split_df = pd.read_csv(user_csv_path)
output_dataset_sizes(split_df)

train_data_df = data_df[split_df["split"] == 0]
train_reviewer_ids = train_data_df.reviewerID.unique()
print(f"Number of unique reviewers in train set: {len(train_reviewer_ids)}")

# Randomly sample (1 - frac) x number of reviewers
# Blackout all the reviews belonging to the randomly sampled reviewers
subsampled_reviewers_count = int((1 - frac) * len(train_reviewer_ids))
subsampled_reviewers = np.random.choice(
train_reviewer_ids, subsampled_reviewers_count, replace=False
)
print(subsampled_reviewers)

blackout_indices = train_data_df[
train_data_df["reviewerID"].isin(subsampled_reviewers)
].index

# Mark all the corresponding reviews of blackout_indices as -1
split_df.loc[blackout_indices, "split"] = NOT_IN_DATASET
output_dataset_sizes(split_df)

# Mark duplicates
duplicated_within_user = data_df[["reviewerID", "reviewText"]].duplicated()
df_deduplicated_within_user = data_df[~duplicated_within_user]
duplicated_text = df_deduplicated_within_user[
df_deduplicated_within_user["reviewText"]
.apply(lambda x: x.lower())
.duplicated(keep=False)
]["reviewText"]
duplicated_text = set(duplicated_text.values)
data_df["duplicate"] = (
data_df["reviewText"].isin(duplicated_text)
) | duplicated_within_user

# Mark html candidates
data_df["contains_html"] = data_df["reviewText"].apply(
lambda x: "<" in x and ">" in x
)

# Mark clean ones
data_df["clean"] = ~data_df["duplicate"] & ~data_df["contains_html"]

# Clear ID val and ID test since we're regenerating
split_df.loc[split_df["split"] == ID_VAL, "split"] = NOT_IN_DATASET
split_df.loc[split_df["split"] == ID_TEST, "split"] = NOT_IN_DATASET

# Regenerate ID val and ID test
train_reviewer_ids = data_df[split_df["split"] == TRAIN]["reviewerID"].unique()
np.random.shuffle(train_reviewer_ids)
cutoff = int(len(train_reviewer_ids) / 2)
id_val_reviewer_ids = train_reviewer_ids[:cutoff]
id_test_reviewer_ids = train_reviewer_ids[cutoff:]
split_df.loc[
(split_df["split"] == NOT_IN_DATASET)
& data_df["clean"]
& data_df["reviewerID"].isin(id_val_reviewer_ids),
"split",
] = ID_VAL
split_df.loc[
(split_df["split"] == NOT_IN_DATASET)
& data_df["clean"]
& data_df["reviewerID"].isin(id_test_reviewer_ids),
"split",
] = ID_TEST

# Sanity check
assert (
data_df[(split_df["split"] == ID_VAL)]["reviewerID"].value_counts().min() == 75
)
assert (
data_df[(split_df["split"] == ID_VAL)]["reviewerID"].value_counts().max() == 75
)
assert (
data_df[(split_df["split"] == ID_TEST)]["reviewerID"].value_counts().min() == 75
)
assert (
data_df[(split_df["split"] == ID_TEST)]["reviewerID"].value_counts().max() == 75
)

# Write out the new splits to user.csv
output_dataset_sizes(split_df)
split_df.to_csv(user_csv_path, index=False)
print("Done.")


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Subsample the Amazon dataset.")
parser.add_argument(
"path",
type=str,
help="Path to the Amazon dataset",
)
parser.add_argument(
"frac",
type=float,
help="Subsample fraction",
)

args = parser.parse_args()
main(args.path, args.frac)
28 changes: 28 additions & 0 deletions dataset_preprocessing/fmow/convert_npy_to_jpg.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import os, sys
import argparse
import numpy as np
from PIL import Image
from pathlib import Path
from tqdm import tqdm

def main():

parser = argparse.ArgumentParser()
parser.add_argument('--root_dir', required=True,
help='The directory where [dataset]/data can be found (or should be downloaded to, if it does not exist).')
config = parser.parse_args()
data_dir = Path(config.root_dir) / 'fmow_v1.0'
image_dir = Path(config.root_dir) / 'fmow_v1.0_images_jpg'
os.makedirs(image_dir, exist_ok=True)

img_counter = 0
for chunk in tqdm(range(101)):
npy_chunk = np.load(data_dir / f'rgb_all_imgs_{chunk}.npy', mmap_mode='r')
for i in range(len(npy_chunk)):
npy_image = npy_chunk[i]
img = Image.fromarray(npy_image, mode='RGB')
img.save(image_dir / f'rgb_img_{img_counter}.jpg')
img_counter += 1

if __name__=='__main__':
main()
Loading

0 comments on commit b38304b

Please sign in to comment.