Speech Emotion Diarization (SED)

Speech Emotion Diarization is a technique that focuses on predicting emotions and their corresponding time boundaries within a speech recording.

The model has been trained using audio samples that include one non-neutral emotional event, which belong to one of the four following transitional sequences:

neutral-emotional
neutral-emotional-neutral
emotional-neutral
emotional

The model's output takes the form of a dictionary comprising emotion components (neutral, happy, angry, and sad) along with their respective start and end boundaries, as exemplified below:

{
   'example.wav': [
      {'start': 0.0, 'end': 1.94, 'emotion': 'n'},  # 'n' denotes neutral
      {'start': 1.94, 'end': 4.48, 'emotion': 'h'}   # 'h' denotes happy
   ]
}

Dependencies

The implementation is based on the popular speech tookit SpeechBrain.

Another implementation of this project can be found here as a SpeechBrain recipe.

To install the dependencies, do pip install -r requirements.txt

Datasets

Test Set

The test is based on Zaion Emotion Dataset (ZED), which can be downloaded via this link.

Training Set Preparation

RAVDESS: https://zenodo.org/record/1188976/files/Audio_Speech_Actors_01-24.zip?download=1

Unzip and rename the folder as "RAVDESS".
ESD: Download the ESD dataset via this link. It should be noted that the prepare_ESD.py script works only with this old version of the dataset.

Unzip and rename the folder as "ESD".
IEMOCAP: https://sail.usc.edu/iemocap/iemocap_release.htm

Unzip.
JL-CORPUS: https://www.kaggle.com/datasets/tli725/jl-corpus?resource=download

Unzip, keep only archive/Raw JL corpus (unchecked and unannotated)/JL(wav+txt) and rename the folder to "JL_corpus".
EmoV-DB: https://openslr.org/115/

Download [bea_Amused.tar.gz, bea_Angry.tar.gz, bea_Neutral.tar.gz, jenie_Amused.tar.gz, jenie_Angry.tar.gz, jenie_Neutral.tar.gz, josh_Amused.tar.gz, josh_Neutral.tar.gz, sam_Amused.tar.gz, sam_Angry.tar.gz, sam_Neutral.tar.gz], unzip and move all the folders into another folder named "EmoV-DB".

Metric

A proposed Emotion Diarization Error Rate is used to evaluate the baselines. The four components are:

False Alarm (FA): Length of non-emotional segments that are predicted as emotional.
Missed Emotion (ME): Length of emotional segments that are predicted as non-emotional.
Emotion Confusion (CF): Length of emotional segments that are assigned to another(other) incorrect emotion(s).
Emotion Overlap (OL): Length of non-overlapped emotional segments that are predicted to contain other overlapped emotions apart from the correct one

Even though frame-wise classification accuracy can also reflect the system's capacity, it is not always convincing because it depends on the frame length (resolution). A higher accuracy of frame-wise classification does not equal that the model can better diarize. Hence, EDER is a more common metric for the task.

Run the code

Model configs and experiment settings can be modified in hparams/train.yaml.

To run the code, do python train.py hparams/train.yaml --zed_folder /path/to/ZED --emovdb_folder /path/to/EmoV-DB --esd_folder /path/to/ESD --iemocap_folder /path/to/IEMOCAP --jlcorpus_folder /path/to/JL_corpus --ravdess_folder /path/to/RAVDESS.

The data preparation may take a while.

A results repository will be generated that contains checkpoints, logs, etc. The frame-wise classification result for each utterance can be found in eder.txt.

Results

The EDER (Emotion Diarization Error Rate) reported here was averaged on 5 different seeds, results of other models (wav2vec2.0, HuBERT) can be found in the paper. You can find our training results (model, logs, etc) here.

model	EDER
WavLM-large	30.2 ± 1.60

It takes about 40 mins/epoch with 1xRTX8000(40G), reduce the batch size if OOM.

Inference

The pretrained models and a easy-inference interface can be found on HuggingFace.

About Speech Emotion Diarization/Zaion Emotion Dataset

@inproceedings{wang2023speech,
  title={Speech emotion diarization: Which emotion appears when?},
  author={Wang, Yingzhi and Ravanelli, Mirco and Yacoubi, Alya},
  booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  pages={1--7},
  year={2023},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
datasets		datasets
hparams		hparams
images		images
utils		utils
DATA_LICENSE		DATA_LICENSE
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py
zed_prepare.py		zed_prepare.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Emotion Diarization (SED)

Dependencies

Datasets

Test Set

Training Set Preparation

Metric

Run the code

Results

Inference

About Speech Emotion Diarization/Zaion Emotion Dataset

About

Releases

Packages

Languages

License

BenoitWang/Speech_Emotion_Diarization

Folders and files

Latest commit

History

Repository files navigation

Speech Emotion Diarization (SED)

Dependencies

Datasets

Test Set

Training Set Preparation

Metric

Run the code

Results

Inference

About Speech Emotion Diarization/Zaion Emotion Dataset

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages