Extract audio descriptors and learn to generate them with transformers.
pip install -e .
Install python >= 3.6
and pytorch
with GPU support if desired.
pip install -r requirements.txt
Demiurge is a tri-modal Generative Adversarial Network (Goodfellow et al. 2014) architecture devised to generate and sequence musical sounds in the waveform domain (Donahue et al. 2019). The architecture combines a sound-generating UnaGAN plus MelGAN model with a custom GAN sequencer. The diagram below explains the relation between the different elements.
The project purpose is to generate predicted audio following the given input audio files. From the picture, the input audio files are stored in the RECORDED AUDIO DB.
- The audio files will first be processed by melgan and unagan to generate a lot more audio files similar to the audio files to form the RAW GENERATED AUDIO DB, which will be the audio source of the output prediction audio file as we assume the audio following the input and the input audio should be similar.
- The audio in the RECORDED AUDIO DB will be processed into descriptors such as MFCCs, and the SEQUENCER GAN, which is the time series prediction model in this repository, will predict upcoming descriptors based on the input audio descriptors.
- As the predicted descriptors are just statistical values and cannot be easily converted back to audio, we will match the predicted descriptors from the model with the extracted descriptors from the wav files in the RAW GENERATED AUDIO DB. Then, the audio reference in the RAW GENERATED AUDIO DB of the matched extracted descriptors will replace the predicted descriptors, and will be merged and combined into output prediction audio file.
From the descripton above, descriptor model(SEQUENCER GAN) is necessary for the prediction workflow. User can use one of the pretrained descriptor model with the wandb run id in the prediction notebook, or train their own model with the instruction in the training section below.
For the descriptor model, there are 4 models to choose from: "LSTM", "LSTMEncoderDecoderModel", "TransformerEncoderOnlyModel", or "TransformerModel". The "LSTM" and "TransformerEncoderOnlyModel" are one step prediction model, while "LSTMEncoderDecoderModel" and "TransformerModel" can predict descriptor sequence with specified sequence length.
After training the model, record the wandb run id and paste it in the prediction notebook. Then, provide paths to the RAW generated audio DB and Prediction DB, and run the notebook. The notebook will generate new descriptors from the descriptor model and convert them back into audio.
The training notebook for the descriptor model is located in the folder train_notebook/.
Follow the instruction in the training notebook to train the descriptor model.
To train the descriptor model, run
python desc/train_function.py --selected_model <1 of 4 models above> --audio_db_dir <path to database> --window_size <input sequence length> --forecast_size <output sequence length>
The audio database shoulf be audio file in ".wav"
The prediction workflow can be described in the diagram below:
- The prediction database will be processed into descriptor input (descriptor database II) for the descriptor model, and the descriptor model will predict the subsequent descriptors based on the input.
- The audio database will be processed into descriptor database I that each descriptor will have ID reference back to the audio segment.
- The query function will replace the predicted new descriptors from the descriptor model with the closest match in the descriptor database I based on the distance function.
- The audio segments referenced by the replaced descriptors from the query function will be combined and merged into a new audio file.
The prediction notebook for the descriptor model is located in predict_notebook/.
Follow the instruction in the prediction notebook to generate new descriptor and convert them back to audio.
The melgan and unagan is used to generate a lot more audio files that are similar to the files in the RECORDED AUDIO DB. This is optional if the RECORDED AUDIO DB is already large enough for the descriptor matching process in the query function.
For the melgan/unagan training, please use notebooks in the folder train_notebook/. The audio database for the melgan and unagan should be the same, and please record wandb run id of the run for sound generation.
After the melgan and unagan are trained, go to unagan generate notebook and set the melgan_run_id and unagan_run_id. The output wav files will be saved to the output_dir specified in the notebook.