This repository contains the datasets and code for the paper A Benchmark for Neural Network Robustness in Skin Cancer Classification. The study was heavily inspired and adapted from Benchmarking Neural Network Robustness to Common Corruptions and Perturbations (ICLR 2019) by Dan Hendrycks and Thomas Dietterich (GitHub).
This benchmark contains previously unpublished dermoscopic melanoma and nevus images and is designed to test skin lesion classifiers on out-of-distribution (OOD) data. Images are available for download, albeit without ground truth labels in order to maintain the OOD nature of the benchmark. The benchmark consists of three datasets:
- SAM: contains unmodified dermoscopic images which can be used to establish a classifier's performance on clean images.
- SAM-C: contains corrupted dermoscopic images derived from SAM. This dataset can be used to establish a classifier's robustness to low-quality inputs (e.g. blurry, obstructed or distorted images).
- SAM-P: contains perturbed dermoscopic images derived from SAM. This dataset can be used to establish a classifier's robustness to subtle, repeated image changes (e.g. continuous change of brightness).
All images have a resolution of 1000x667 pixels. For this benchmark, reference classifiers were trained and evaluated with rescaled images (i.e. 448x448 pixels). Therefore, any direct comparisons should also use a resolution of 448x448 pixels. Classifier training should also be carried out without any of the data augmentations that were used to create SAM-C and SAM-P (i.e. no rotations, tilts, gaussian noise, etc.). If there is an augmentation overlap, it should be explicitly stated.
SAM (Skin Archive Munich) contains 85 unique lesions. Of these 37 are classified as melanoma and 48 as nevus. As each lesion was photographed multiple times with different camera angles and magnifications, the total number of images in the dataset increases to 319, of which 194 are melanoma and 125 are nevus.
SAM-C is derived from SAM by corrupting its images using 14 corruption types drawn from four main categories - noise, blur, dermoscopy and digital - and five different severity levels. This results in an effective dataset size 22,330 of images.
SAM-P is derived from SAM by perturbing its images using 10 corruption types drawn from four main categories - noise, blur, dermoscopy and digital. Each image from SAM is perturbed 31 times, resulting in perturbation sequences similar to the ones shown below. This results in an effective dataset size of 98,890 images. In addition to the PNGs, we also provide an MP4 version of SAM-P for illustration, where each perturbation sequence is converted into a short video of 31 frames.
Download MP4 version of SAM-P here [17 GB].
We provide the full corruption and perturbation results for all architectures broken down for each transformation type and averaged across all five repetitions. For SAM-C we show both the mBCE (sheet 1) and relative mBCE (sheet 2). For SAM-P we show the mFR. In each spreadsheet, Table 1 shows the unadjusted results while Table 2 shows the results that were adjusted by our AlexNet baseline.
Researchers who decide to use this benchmark should submit their results via email to [email protected] and reference 'Robustness Benchmark Submission' in the subject line. Submission file templates are available here. The name for each submission file must adhere to the following pattern:
[CustomText]_DatasetName.csv
where CustomText
can be anything except underscore. DatasetName
is fixed to either SAM, SAM-C, SAM-P, SAM-C-Extra or SAM-P-Extra and should not be changed. For example, a researcher could submit a set of files named MyApproachA_SAM.csv
, MyApproachA_SAM-C.csv
and MyApproachA_SAM-P.csv
to obtain results for SAM, SAM-C and SAM-P for his/her approach A.
Each submission file is a single CSV (comma-separated value) file, with three columns: image_path
, nevus
and melanoma
. Column image_path
is already prefilled and should not be changed. Columns nevus
and melanoma
require the researcher to fill in the respective diagnostic confidence scores which are expressed as a floating-point values in the closed interval [0.0, 1.0], where 0.5 is used as the binary classification threshold.
Evaluation of the submitted files is carried out by us using the metrics for SAM, SAM-C and SAM-P that were outlined in the paper. For each submitted file, we will provide a corresponding csv file that will list the unadjusted metrics for each transformation. Researchers can then calculate the unadjusted mBCE, relative mBCE and mFR, as well as the baseline adjusted mBCE, relative mBCE and mFR. We have provided an exemplary notebook and the necessary code to do so in this repo.
- Download and extract SAM, SAM-C and SAM-P.
- Download submission file templates and rename
CustomText
to anything except underscore. - Evaluate your classifier and fill in templates with your obtained diagnostic confidence scores.
- Send submission templates via email to [email protected], referencing 'Robustness Benchmark Submission' in the headline.
- Receive results and calculate the unadjusted and adjusted metrics using our AlexNet baseline and provided code.
Maron R.C. et al., A benchmark for neural network robustness in skin cancer classification, European Journal of Cancer