Release v1.1.0 · p-lambda/wilds

The v1.1.0 release contains a new Py150 benchmark dataset for code completion, as well as updates to several existing datasets and default models to make them significantly faster and easier to use.

Some of these changes are breaking changes that will impact users who are currently running experiments with WILDS. We sincerely apologize for the inconvenience. We ask all users to update their package to v1.1.0, which will automatically update your datasets. In addition, please update your default models, for example by using the latest example scripts in this repo. These changes were primarily made to accelerate model training, which was a bottleneck for many users; at this time, we do not expect to have to make further changes to the existing datasets or default models.

New datasets

New benchmark dataset: Py150

The Py150-WILDS dataset is a code completion dataset, where the distribution shift is over code from different Github repositories.
We focus on accuracy on the subpopulation of class and method tokens, as prior work has shown that those are the most frequent queries in real-world code completion settings.
It is a variant of the Py150 dataset from Raychev et al., 2016.
See our paper for more details.

Additional dataset: SQF

The SQF dataset is based on the stop-question-and-frisk dataset released by the New York Police Department. We adapt the version processed by Goel et al., 2016. The task is to predict criminal possession of a weapon.
We use this dataset to study distribution shifts in an algorithmic fairness context. Specifically, we consider subpopulation shifts across locations and race groups. However, while there are large performance gaps, we did not find that they were caused by the distribution shift. We therefore did not include this dataset as part of the official benchmark.

Major updates to existing datasets

Note that datasets are versioned separately from the main WILDS version. We have two major updates (i.e., breaking, non-backwards-compatible changes) to datasets.

Amazon v1.0 -> v2.0

To speed up model training, we have subsampled the number of reviewers in this dataset to 25% of its original size, while keeping the same number of reviews per reviewer.

iWildCam v1.0 -> v2.0

Previously, the ID split was done uniformly at random, meaning that images from the same sequence (i.e., taken within a few seconds of each other by the same camera) could be found across all of the training / validation (ID) / test (ID) sets.
In v2.0, we have redone the ID split so that all images taken on the same day by the same camera are in only one of the training, validation (ID), or test (ID) sets. In other words, these sets still comprise images from the same cameras, but taken on different days.
In line with the new iWildCam 2021 challenge on Kaggle, we have also removed the following images:
- images that include humans or pictures taken indoors.
- images with non-animal categories such as start and unidentifiable.
- images in categories such as unknown, unknown raptor and unknown rat.
We added back in location 537 that was previously removed as we mistakenly believed those images were corrupted.
We have re-split the data into training, validation (ID), test (ID), validation (OOD), and test (OOD) sets. This is a different random split from v1.0.
Since we remove any classes that do not end up in the train split, removing those images and redoing the split gave us a different set of species. There are now 182 classes instead of 186. Specifically, the following classes have been removed: ['unknown', 'macaca fascicularis', 'proechimys sp', 'unidentifiable', 'turtur calcospilos', 'streptopilia senegalensis', 'equus africanus', 'macaca nemestrina', 'start', 'paleosuchus sp', 'unknown raptor', 'unknown rat', 'misfire', 'mustela lutreolina', 'canis latrans', 'myoprocta pratti', 'xerus rutilus', 'end', 'psophia crepitans', 'ictonyx striatus']. The following classes have been added: [‘praomys tullbergi', 'polyplectron chalcurum', 'ardeotis kori', 'phaetornis sp', 'mus minutoides', 'raphicerus campestris', 'tigrisoma mexicanum', 'leptailurus serval', 'malacomys longipes', 'oenomys hypoxanthus', 'turdus olivaceus', 'macaca sp', 'leiothrix argentauris', 'lophura sp', 'mazama temama', 'hippopotamus amphibius']. For convenience, we have also added a categories.csv that maps from label IDs to species names.
To speed up downloading and model training (by reducing the I/O bottleneck), we have also resized all images to have a height of 448px while keeping the original aspect ratio. All images are wide (so they now have a min dimension of 448px). Note that as JPEG compression is lossy, this procedure gives different images from resizing the full-sized image in the code after loading it.

Minor updates to existing datasets

We made two backwards-compatible changes to existing datasets. We encourage all users to update these datasets; these updates should leave results unchanged (modulo training randomness). In future versions of the WILDS package, we will deprecate the older versions of these datasets.

FMoW v1.0 -> v1.1

Previously, the images were stored as chunks in .npy files and read in using NumPy memmapping.
Now, we have converted them (losslessly) into individual PNG images. This should help with disk I/O and memory usage, and make them more convenient to visualize and use in other pipelines.

PovertyMap v1.0 -> v1.1

Previously, the images were stored in a single .npy file and read in using NumPy memmapping.
Now, we have converted them (loselessly) into individual compressed .npz files. This should help with disk I/O and memory usage.
We have correspondingly updated the default number of workers for the data loader from 1 to 4.

Default model updates

We have updated the default models for several datasets. Please take note of these changes if you are currently running experiments with these datasets.

Amazon and CivilComments

To speed up model training, we have switched from BERT-base-uncased to DistilBERT-base-uncased. This obtains roughly similar accuracy but at twice the speed.
For CivilComments, we have also increased the number of replicates from 3 to 5, to reduce variability in the reported performance.

Camelyon17

Previously, we were upsizing each image to 224x224 before passing it into the model.
We now leave the images at their original resolution of 96x96, which significantly speeds up model training.

iWildCam

Previously, we were resizing each image to 224x224 before passing it into the model. However, this limited model accuracy, as the animals in the images can sometimes be quite small.
We now resize each image to 448x448 before passing it into the model, which improves accuracy and macro F1 across the board.

FMoW

For consistency with the other datasets, we have changed the early stopping validation criterion (val_metric) from acc_avg to acc_worst_region.

PovertyMap

For consistency with the other datasets, we have changed the early stopping validation criterion (val_metric) from r_all to r_wg.

Other changes

We have uploaded an executable version of our paper to CodaLab. This contains the exact commands, code, and data used for each experiment reported in our paper. The trained model weights for every experiment can also be found there.
To ease downloading, we have added wilds/download_datasets.py, which allows users to download all (or a subset of) datasets at once. Please see the README for instructions.
We have added a convenience function for getting the appropriate constructor for each dataset in wilds/get_dataset.py. This function allows you to specify a version argument. If this is not specified, it defaults to the latest available version for that dataset. If that version is not downloaded and the download argument is also set, then it will automatically download that version.
The example script examples/run_expt.py now also takes in a version argument.
We have added download sizes and expected training times to the README.
We have updated the default inputs for WILDSDatasets.eval methods for various datasets. For example, eval for most classification datasets now take in predicted labels by default, while the predictions were previously passed in as logits. The default inputs vary across datasets, and we document this in the docstring of each eval method.
We made a few updates to the code in examples/ to interface better with language modeling tasks (for Py150). None of these changes affect the results or the interface with algorithms.
We updated the code in examples/ to save model predictions in an appropriate format for submissions to the leaderboard.
Finally, we have also updated our paper to streamline the writing and include these new numbers and datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.1.0