Audio Classification with Noisy Dataset with multi stage semi-supervised learning
Given a wav file (in variable length), predict its corresponding label(s), each wav could be in multiple classes
Original Dataset can be found in Kaggle: https://www.kaggle.com/c/freesound-audio-tagging-2019/data.
To save time in data preprocessing, we also use the processed dataset (converting raw wav data to numpy matrix with Logmel transformation) https://www.kaggle.com/daisukelab/fat2019_prep_mels1
The dataset consists of both curated data (with accurate labels), and noisy data (with labels, but not sure whether accurate or not). Noisy data size is much larger than curated data size.
In our code, we have implemented multiple models(CNN, CNN+LSTM, ResNet), for simplicity, experiments are done based on CNN model by default.
Since CNN type model only allow fixed length input, while the data input length in our dataset is variable, we need to cut the long input audio into segments with fixed length (padding if necessary), and use the average of each segment's prediction as final prediction of the original audio data.
Train the model on roughly selected noisy data (i.e. mels_trn_noisy_best50s.pkl in https://www.kaggle.com/daisukelab/fat2019_prep_mels1), details for how to roughly select from noisy data can be found in https://www.kaggle.com/daisukelab/creating-fat2019-preprocessed-data
Started from Model 0 which is trained in Stage 0, we train the model again on the curated dataset.
Using Model 1 which is trained in Stage 1, we can filter out parts of noisy data which we are confident that its corresponding labels are correct. At the end of this operation, we will get: 1. labeled data (consists of curated data and noisy data we are confident on its labels) and 2. unlabeled data (noisy data that we are inconfident on its labels)
Both labeled data {xl, yl} and unlabeled data {xu} will be used in this stage. Before the input data feeding into classifier, a stochastic data augmentation is required. Here we use SpecAugment as the augmentation
Loss function consists of two parts:
1. For {xl, yl} BCELoss will be calculated
2. For both {xl} and {xu} will do stochastic augmentation by 2 times: Take xl for example
, where fθ refers to the classifier and g refers to data augmentation function. Then the squared difference loss will be calculated on the model outputs: . The main idea of this loss is to regularize the network such that it generates about the same outputs for the same data input that undergoes data augmentation.
Since only curated data (i.e. mels_train_curated.pkl in FAT2019 dataset) have correct labels, the evaluation is done based on this data. We split the mels_train_curated.pkl into three parts: curated training data, curated validation data and curated testing data in 8:1:1.
Evaluation Metrics we use is label-weighted label-ranking average precision
And here is the results for each stage:
Stage | Validation | Testing |
---|---|---|
Stage 0 | 0.285 | 0.282 |
Stage 1 | 0.828 | 0.791 |
Stage 2 | 0.836 | 0.816 |
- make sure you place correct data path in config.ini, all the data we use can be found in https://www.kaggle.com/c/freesound-audio-tagging-2019/data and https://www.kaggle.com/daisukelab/fat2019_prep_mels1
- one-click-run: python3 runme.py. You can find out the order of running all the codes in this script.