Merge pull request #39 from p-lambda/dev

V1.1 updates
p-lambda · Mar 10, 2021 · b38304b · b38304b
2 parents 28ef873 + 9c84fec
commit b38304b
Show file tree

Hide file tree

Showing 48 changed files with 2,517 additions and 522 deletions.
diff --git a/README.md b/README.md
@@ -29,7 +29,7 @@ pip install wilds
 If you have already installed it, please check that you have the latest version:
 ```bash
 python -c "import wilds; print(wilds.__version__)"
-# This should print "1.0.0". If it doesn't, update by running:
+# This should print "1.1.0". If it doesn't, update by running:
 pip install -U wilds
 ```
 
@@ -42,15 +42,15 @@ pip install -e .
 
 ### Requirements
 - numpy>=1.19.1
+- ogb>=1.2.6
+- outdated>=0.2.0
 - pandas>=1.1.0
 - pillow>=7.2.0
-- torch>=1.7.0
-- tqdm>=4.53.0
 - pytz>=2020.4
-- outdated>=0.2.0
-- ogb>=1.2.3
+- torch>=1.7.0
 - torch-scatter>=2.0.5
 - torch-geometric>=1.6.1
+- tqdm>=4.53.0 
 
 Running `pip install wilds` or `pip install -e .` will automatically check for and install all of these requirements
 except for the `torch-scatter` and `torch-geometric` packages, which require a [quick manual install](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html#installation-via-binaries).
@@ -70,39 +70,69 @@ To run these scripts, you will need to install these additional dependencies:
 
 All baseline experiments in the paper were run on Python 3.8.5 and CUDA 10.1.
 
-## Usage
-### Default models
-In the `examples/` folder, we provide a set of scripts that we used to train models on the WILDS package. These scripts are configured with the default models and hyperparameters that we used for all of the baselines described in our paper. All baseline results in the paper can be easily replicated with commands like:
+
+## Using the example scripts
+
+In the `examples/` folder, we provide a set of scripts that can be used to download WILDS datasets and train models on them.
+These scripts are configured with the default models and hyperparameters that we used for all of the baselines described in our paper. All baseline results in the paper can be easily replicated with commands like:
 
 ```bash
-cd examples
-python run_expt.py --dataset iwildcam --algorithm ERM --root_dir data
-python run_expt.py --dataset civilcomments --algorithm groupDRO --root_dir data
+python examples/run_expt.py --dataset iwildcam --algorithm ERM --root_dir data
+python examples/run_expt.py --dataset civilcomments --algorithm groupDRO --root_dir data
 ```
 
 The scripts are set up to facilitate general-purpose algorithm development: new algorithms can be added to `examples/algorithms` and then run on all of the WILDS datasets using the default models.
 
 The first time you run these scripts, you might need to download the datasets. You can do so with the `--download` argument, for example:
 ```
-python run_expt.py --dataset civilcomments --algorithm groupDRO --root_dir data --download
+python examples/run_expt.py --dataset civilcomments --algorithm groupDRO --root_dir data --download
 ```
 
+Alternatively, you can use the standalone `wilds/download_datasets.py` script to download the datasets, for example:
+
+```bash
+python wilds/download_datasets.py --root_dir data
+```
+
+This will download all datasets to the specified `data` folder. You can also use the `--datasets` argument to download particular datasets.
+
+These are the sizes of each of our datasets, as well as their approximate time taken to train and evaluate the default model for a single ERM run using a NVIDIA V100 GPU.
+
+| Dataset command | Modality | Download size (GB) | Size on disk (GB) | Train+eval time (Hours) |
+|-----------------|----------|--------------------|-------------------|-------------------------|
+| iwildcam        | Image    | 11                 | 25                | 7                       |
+| camelyon17      | Image    | 10                 | 15                | 2                       |
+| ogb-molpcba     | Graph    | 0.04               | 2                 | 15                      |
+| civilcomments   | Text     | 0.1                | 0.3               | 4.5                     |
+| fmow            | Image    | 50                 | 55                | 6                       |
+| poverty         | Image    | 12                 | 14                | 5                       |
+| amazon          | Text     | 6.6                | 7                 | 5                       |
+| py150           | Text     | 0.1                | 0.8               | 9.5                     |
+
+While the `camelyon17` dataset is small and fast to train on, we advise against using it as the only dataset to prototype methods on, as the test performance of models trained on this dataset tend to exhibit a large degree of variability over random seeds.
+
+The image datasets (`iwildcam`, `camelyon17`, `fmow`, and `poverty`) tend to have high disk I/O usage. If training time is much slower for you than the approximate times listed above, consider checking if I/O is a bottleneck (e.g., by moving to a local disk if you are using a network drive, or by increasing the number of data loader workers). To speed up training, you could also disable evaluation at each epoch or for all splits by toggling `--evaluate_all_splits` and related arguments.
+
+We have an [executable version](https://wilds.stanford.edu/codalab) of our paper on CodaLab that contains the exact commands, code, and data used for the experiments reported in our paper. Trained model weights for all datasets can also be found there.
+
+
+## Using the WILDS package
 ### Data loading
 
 The WILDS package provides a simple, standardized interface for all datasets in the benchmark.
 This short Python snippet covers all of the steps of getting started with a WILDS dataset, including dataset download and initialization, accessing various splits, and preparing a user-customizable data loader.
 
 ```py
->>> from wilds.datasets.iwildcam_dataset import IWildCamDataset
+>>> from wilds import get_dataset
 >>> from wilds.common.data_loaders import get_train_loader
 >>> import torchvision.transforms as transforms
 
 # Load the full dataset, and download it if necessary
->>> dataset = IWildCamDataset(download=True)
+>>> dataset = get_dataset(dataset='iwildcam', download=True)
 
 # Get the training set
 >>> train_data = dataset.get_subset('train',
-...                                 transform=transforms.Compose([transforms.Resize((224,224)),
+...                                 transform=transforms.Compose([transforms.Resize((448,448)),
 ...                                                               transforms.ToTensor()]))
 
 # Prepare the standard data loader
@@ -171,11 +201,12 @@ Invoking the `eval` method of each dataset yields all metrics reported in the pa
 >>> dataset.eval(all_y_pred, all_y_true, all_metadata)
 {'recall_macro_all': 0.66, ...}
 ```
+Most `eval` methods take in predicted labels for `all_y_pred` by default, but the default inputs vary across datasets and are documented in the `eval` docstrings of the corresponding dataset class.
 
 ## Citing WILDS
 If you use WILDS datasets in your work, please cite [our paper](https://arxiv.org/abs/2012.07421) ([Bibtex](https://wilds.stanford.edu/assets/files/bibtex.md)):
 
-- **WILDS: A Benchmark of in-the-Wild Distribution Shifts** (2020). Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang.
+- **WILDS: A Benchmark of in-the-Wild Distribution Shifts** (2021). Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang.
 
 Please also cite the original papers that introduce the datasets, as listed on the [datasets page](https://wilds.stanford.edu/datasets/).
 

diff --git a/dataset_preprocessing/amazon_yelp/subsample_amazon.py b/dataset_preprocessing/amazon_yelp/subsample_amazon.py
@@ -0,0 +1,157 @@
+import argparse
+import csv
+import os
+
+import pandas as pd
+import numpy as np
+
+# Fix the seed for reproducibility
+np.random.seed(0)
+
+"""
+Subsample the Amazon dataset.
+
+Usage:
+    python dataset_preprocessing/amazon_yelp/subsample_amazon.py <path> <frac>
+"""
+
+NOT_IN_DATASET = -1
+# Split: {'train': 0, 'val': 1, 'id_val': 2, 'test': 3, 'id_test': 4}
+TRAIN, OOD_VAL, ID_VAL, OOD_TEST, ID_TEST = range(5)
+
+
+def main(dataset_path, frac=0.25):
+    def output_dataset_sizes(split_df):
+        print("-" * 50)
+        print(f'Train size: {len(split_df[split_df["split"] == TRAIN])}')
+        print(f'Val size: {len(split_df[split_df["split"] == OOD_VAL])}')
+        print(f'ID Val size: {len(split_df[split_df["split"] == ID_VAL])}')
+        print(f'Test size: {len(split_df[split_df["split"] == OOD_TEST])}')
+        print(f'ID Test size: {len(split_df[split_df["split"] == ID_TEST])}')
+        print(
+            f'Number of examples not included: {len(split_df[split_df["split"] == NOT_IN_DATASET])}'
+        )
+        print("-" * 50)
+        print("\n")
+
+    data_df = pd.read_csv(
+        os.path.join(dataset_path, "reviews.csv"),
+        dtype={
+            "reviewerID": str,
+            "asin": str,
+            "reviewTime": str,
+            "unixReviewTime": int,
+            "reviewText": str,
+            "summary": str,
+            "verified": bool,
+            "category": str,
+            "reviewYear": int,
+        },
+        keep_default_na=False,
+        na_values=[],
+        quoting=csv.QUOTE_NONNUMERIC,
+    )
+
+    user_csv_path = os.path.join(dataset_path, "splits", "user.csv")
+    split_df = pd.read_csv(user_csv_path)
+    output_dataset_sizes(split_df)
+
+    train_data_df = data_df[split_df["split"] == 0]
+    train_reviewer_ids = train_data_df.reviewerID.unique()
+    print(f"Number of unique reviewers in train set: {len(train_reviewer_ids)}")
+
+    # Randomly sample (1 - frac) x number of reviewers
+    # Blackout all the reviews belonging to the randomly sampled reviewers
+    subsampled_reviewers_count = int((1 - frac) * len(train_reviewer_ids))
+    subsampled_reviewers = np.random.choice(
+        train_reviewer_ids, subsampled_reviewers_count, replace=False
+    )
+    print(subsampled_reviewers)
+
+    blackout_indices = train_data_df[
+        train_data_df["reviewerID"].isin(subsampled_reviewers)
+    ].index
+
+    # Mark all the corresponding reviews of blackout_indices as -1
+    split_df.loc[blackout_indices, "split"] = NOT_IN_DATASET
+    output_dataset_sizes(split_df)
+
+    # Mark duplicates
+    duplicated_within_user = data_df[["reviewerID", "reviewText"]].duplicated()
+    df_deduplicated_within_user = data_df[~duplicated_within_user]
+    duplicated_text = df_deduplicated_within_user[
+        df_deduplicated_within_user["reviewText"]
+        .apply(lambda x: x.lower())
+        .duplicated(keep=False)
+    ]["reviewText"]
+    duplicated_text = set(duplicated_text.values)
+    data_df["duplicate"] = (
+        data_df["reviewText"].isin(duplicated_text)
+    ) | duplicated_within_user
+
+    # Mark html candidates
+    data_df["contains_html"] = data_df["reviewText"].apply(
+        lambda x: "<" in x and ">" in x
+    )
+
+    # Mark clean ones
+    data_df["clean"] = ~data_df["duplicate"] & ~data_df["contains_html"]
+
+    # Clear ID val and ID test since we're regenerating
+    split_df.loc[split_df["split"] == ID_VAL, "split"] = NOT_IN_DATASET
+    split_df.loc[split_df["split"] == ID_TEST, "split"] = NOT_IN_DATASET
+
+    # Regenerate ID val and ID test
+    train_reviewer_ids = data_df[split_df["split"] == TRAIN]["reviewerID"].unique()
+    np.random.shuffle(train_reviewer_ids)
+    cutoff = int(len(train_reviewer_ids) / 2)
+    id_val_reviewer_ids = train_reviewer_ids[:cutoff]
+    id_test_reviewer_ids = train_reviewer_ids[cutoff:]
+    split_df.loc[
+        (split_df["split"] == NOT_IN_DATASET)
+        & data_df["clean"]
+        & data_df["reviewerID"].isin(id_val_reviewer_ids),
+        "split",
+    ] = ID_VAL
+    split_df.loc[
+        (split_df["split"] == NOT_IN_DATASET)
+        & data_df["clean"]
+        & data_df["reviewerID"].isin(id_test_reviewer_ids),
+        "split",
+    ] = ID_TEST
+
+    # Sanity check
+    assert (
+        data_df[(split_df["split"] == ID_VAL)]["reviewerID"].value_counts().min() == 75
+    )
+    assert (
+        data_df[(split_df["split"] == ID_VAL)]["reviewerID"].value_counts().max() == 75
+    )
+    assert (
+        data_df[(split_df["split"] == ID_TEST)]["reviewerID"].value_counts().min() == 75
+    )
+    assert (
+        data_df[(split_df["split"] == ID_TEST)]["reviewerID"].value_counts().max() == 75
+    )
+
+    # Write out the new splits to user.csv
+    output_dataset_sizes(split_df)
+    split_df.to_csv(user_csv_path, index=False)
+    print("Done.")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Subsample the Amazon dataset.")
+    parser.add_argument(
+        "path",
+        type=str,
+        help="Path to the Amazon dataset",
+    )
+    parser.add_argument(
+        "frac",
+        type=float,
+        help="Subsample fraction",
+    )
+
+    args = parser.parse_args()
+    main(args.path, args.frac)
diff --git a/dataset_preprocessing/fmow/convert_npy_to_jpg.py b/dataset_preprocessing/fmow/convert_npy_to_jpg.py
@@ -0,0 +1,28 @@
+import os, sys
+import argparse
+import numpy as np
+from PIL import Image
+from pathlib import Path
+from tqdm import tqdm
+
+def main():
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--root_dir', required=True,
+                        help='The directory where [dataset]/data can be found (or should be downloaded to, if it does not exist).')
+    config = parser.parse_args()
+    data_dir = Path(config.root_dir) / 'fmow_v1.0'
+    image_dir = Path(config.root_dir) / 'fmow_v1.0_images_jpg'
+    os.makedirs(image_dir, exist_ok=True)
+
+    img_counter = 0
+    for chunk in tqdm(range(101)):
+        npy_chunk = np.load(data_dir / f'rgb_all_imgs_{chunk}.npy', mmap_mode='r')
+        for i in range(len(npy_chunk)):
+            npy_image = npy_chunk[i]
+            img = Image.fromarray(npy_image, mode='RGB')
+            img.save(image_dir / f'rgb_img_{img_counter}.jpg')
+            img_counter += 1
+
+if __name__=='__main__':
+    main()